Colorectal Cancer (CRC) is the second most common cancer in women (614,000 cases per year) and the third most common in men(746,000 cases per year). Currently it is the most common malignant cancer in the gatrointestinal tract, representing 13% of all malignant tumors, and it is considered the main cause of death in gastrointestinal cancer Rebecca L. Siegel MPH (2015). It can be arised from one or a combination of three differen mechanisms, namely chromosmomal inestability (CIN), CpG islands methylator phenotype (CIMP), and microsatellite inestability (MSI). Understanding the specific mechanisms of tumorigenesis and the underlaying genetic and epigenetic traits is crutial in the disease phenotype comprehension. The Cancer Genome Atlas (TCGA) has comprehensively profiled this type of cancer in a patient cohort. Here we analyze the expression profiles of those patients, accessible in the form of a raw RNA-seq counts produced by Rahman et al. (2015) using a pipeline based on the R/Bioconductor software package Rsubread.
We start importing the raw table of counts.
coadse <- readRDS(file.path("rawCounts", "seCOAD.rds"))
summary(coadse$type)
normal tumor
41 483
dim(colData(coadse))
[1] 524 549
In this TCGA RNA-seq data we have a total of 524 samples (483 tumor samples and 41 non-tumor samples) that have 549 clinical variables/factors that could be analysed. However, we are not going to explore all these variables, but those that can have more relevance in the data of study.
In order to know more about the data, we explore the colData column, that corresponds to clinical variables, and their corresponding metadata.
colData(coadse)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
type bcr_patient_uuid
<factor> <factor>
TCGA.3L.AA1B.01A.11R.A37K.07 tumor A94E1279-A975-480A-93E9-7B1FF05CBCBF
TCGA.4N.A93T.01A.11R.A37K.07 tumor 92554413-9EBC-4354-8E1B-9682F3A031D9
TCGA.4T.AA8H.01A.11R.A41B.07 tumor A5E14ADD-1552-4606-9FFE-3A03BCF76640
TCGA.5M.AAT4.01A.11R.A41B.07 tumor NA
TCGA.5M.AAT5.01A.21R.A41B.07 tumor NA
bcr_patient_barcode form_completion_date
<factor> <factor>
TCGA.3L.AA1B.01A.11R.A37K.07 TCGA-3L-AA1B 2014-4-22
TCGA.4N.A93T.01A.11R.A37K.07 TCGA-4N-A93T 2014-10-1
TCGA.4T.AA8H.01A.11R.A41B.07 TCGA-4T-AA8H 2014-6-5
TCGA.5M.AAT4.01A.11R.A41B.07 NA NA
TCGA.5M.AAT5.01A.21R.A41B.07 NA NA
prospective_collection
<factor>
TCGA.3L.AA1B.01A.11R.A37K.07 YES
TCGA.4N.A93T.01A.11R.A37K.07 YES
TCGA.4T.AA8H.01A.11R.A41B.07 NO
TCGA.5M.AAT4.01A.11R.A41B.07 NA
TCGA.5M.AAT5.01A.21R.A41B.07 NA
mcols(colData(coadse), use.names=TRUE)
DataFrame with 549 rows and 2 columns
labelDescription
<character>
type sample type (tumor/normal)
bcr_patient_uuid bcr patient uuid
bcr_patient_barcode bcr patient barcode
form_completion_date form completion date
prospective_collection tissue prospective collection indicator
... ...
lymph_nodes_pelvic_pos_total total pelv lnp
lymph_nodes_aortic_examined_count total aor lnr
lymph_nodes_aortic_pos_by_he aln pos light micro
lymph_nodes_aortic_pos_by_ihc aln pos ihc
lymph_nodes_aortic_pos_total total aor-lnp
CDEID
<character>
type NA
bcr_patient_uuid NA
bcr_patient_barcode 2673794
form_completion_date NA
prospective_collection 3088492
... ...
lymph_nodes_pelvic_pos_total 3151828
lymph_nodes_aortic_examined_count 3104460
lymph_nodes_aortic_pos_by_he 3151832
lymph_nodes_aortic_pos_by_ihc 3151831
lymph_nodes_aortic_pos_total 3151827
These metadata consists of two columns of information about the clinical variables.
One called labelDescription contains a succint description of the variable, often
not more self-explanatory than the variable name itself, and the other called
‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This
identifier can be use in https://cdebrowser.nci.nih.gov to search for further
information about the associated clinical variable using the Advanced search
form and the Public ID attribute search.
Now, explore the row (feature) data.
rowData(coadse)
DataFrame with 20115 rows and 3 columns
symbol txlen txgc
<character> <integer> <numeric>
1 A1BG 3322 0.564419024683925
2 A2M 4844 0.488232865400495
9 NAT1 2280 0.394298245614035
10 NAT2 1322 0.389561270801815
12 SERPINA3 3067 0.524942940984676
... ... ... ...
100996331 POTEB 1706 0.430832356389215
101340251 SNORD124 104 0.490384615384615
101340252 SNORD121B 81 0.407407407407407
102724473 GAGE10 538 0.505576208178439
103091865 BRWD1-IT2 1028 0.592412451361868
rowRanges(coadse)
GRanges object with 20115 ranges and 3 metadata columns:
seqnames ranges strand | symbol txlen
<Rle> <IRanges> <Rle> | <character> <integer>
1 chr19 58345178-58362751 - | A1BG 3322
2 chr12 9067664-9116229 - | A2M 4844
9 chr8 18170477-18223689 + | NAT1 2280
10 chr8 18391245-18401218 + | NAT2 1322
12 chr14 94592058-94624646 + | SERPINA3 3067
... ... ... ... . ... ...
100996331 chr15 20835372-21877298 - | POTEB 1706
101340251 chr17 40027542-40027645 - | SNORD124 104
101340252 chr9 33934296-33934376 - | SNORD121B 81
102724473 chrX 49303669-49319844 + | GAGE10 538
103091865 chr21 39313935-39314962 + | BRWD1-IT2 1028
txgc
<numeric>
1 0.564419024683925
2 0.488232865400495
9 0.394298245614035
10 0.389561270801815
12 0.524942940984676
... ...
100996331 0.430832356389215
101340251 0.490384615384615
101340252 0.407407407407407
102724473 0.505576208178439
103091865 0.592412451361868
-------
seqinfo: 455 sequences (1 circular) from hg38 genome
By using the rowData and rowRanges functions we can see information about the
features of interest. In this case each row represents a gene transcript and we
can see information about the length, the GC content, the chromosome where it is
located, the strand (forward or reverse), etc.
To perform quality assessment and normalization we need first to load the
edgeR R/Bioconductor package,
create a DGEList object and store it in the results folder.
dge <- DGEList(counts=assays(coadse)$counts, genes=mcols(coadse))
saveRDS(dge, file.path("results", "dge.rds"))
Since gene expression data contain a big amount of variability, it is a good practice to stabilize variability by transforming the expression values into logarithmic scale. In order to do so, we calculate \(\log_2\) CPM values of expression and put them as an additional assay element to ease their manipulation.
assays(coadse)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(coadse)$logCPM[1:5, 1:5]
TCGA.3L.AA1B.01A.11R.A37K.07 TCGA.4N.A93T.01A.11R.A37K.07
1 0.2611757 3.518905
2 10.0130074 6.668572
9 -5.8591298 -5.859130
10 -5.8591298 -5.859130
12 5.7145152 5.917077
TCGA.4T.AA8H.01A.11R.A41B.07 TCGA.5M.AAT4.01A.11R.A41B.07
1 -0.593583 -1.566157
2 6.471984 7.316527
9 -5.859130 -5.859130
10 -5.859130 -5.859130
12 5.505588 5.475205
TCGA.5M.AAT5.01A.21R.A41B.07
1 -0.09524941
2 7.24810042
9 -5.85912981
10 -4.41522112
12 5.44298099
As we can see, for each sample there are different CPM (Counts Per Million) values of expression. Negative ones correspond to very low expressed samples.
From the total of 524 samples, before starting to analyse the data we proceed to do a subset of paired samples (tumor and non-tumor samples extracted from the same patient). The reason of doing this is to reduce the within variance, as far as variation between different individuals could be found due to environmental factors. In order to obtain a better estimate of these expression level differences between Tumor and Non-tumor samples, we searched in the TCGA barcode those patients that were present in both Tumor and Non-tumor samples. This meant that those samples belonged to the same patient and, thus, were paired.
df <- data.frame(rbind(table(coadse$bcr_patient_barcode, coadse$type)))
df_paired<-df[(df$normal > 0) & (df$tumor > 0),]
paired_mask <-coadse$bcr_patient_barcode %in% rownames(df_paired)
coadse.paired<- coadse[,paired_mask]
dge.paired <- dge[,paired_mask]
table(coadse.paired$type)
normal tumor
38 38
As is shown in this table, we finally found a total of 38 non-tumor samples and 38 tumor paired samples. However, we consider that with this number of samples should be sufficient powered to detect expression changes between tumor and non-tumor samples.
Before proceeding with any normalization or analysis step, we want to collect an overview of the data that we are working with. Here we examine the sequencing depth by plotting the total number of reads mapped to the genome per sample.
Figure 1: Library sizes in increasing order
In the plot 1 we observe two main aspects that are worth paying attention to. First, we observe that in general the tumor samples present a lower sequencing depth than the normal samples. Even though we could not find any reliable explaination for this behaviour so far in our research, we considered this event dispensable at this time as the sequencing depth will be taken into consideration during the normalization steps. One other important aspect is that we observe that some sample present considerable differences in the sequencing depth and specifically we can distinguish a couple of samples on the left of the graph that present considerably low sequencing depths: this may generate problems in further analysis of the data. For this reason we identify these samples by looking at the at the sample depth as shown below.
sampledepth <- round(dge.paired$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(coadse.paired), 6, 12)
sort(sampledepth)
A6.2679 A6.2682 AA.3531 AA.3517 AA.3534 A6.2683 AA.3518 AA.3520 AA.3522
5.9 9.0 13.6 14.7 15.4 15.8 16.5 16.8 17.0
AA.3516 AA.3525 AA.3514 A6.2680 AA.3527 A6.2685 A6.2671 A6.2678 AA.3496
17.2 17.5 17.7 18.1 19.0 19.2 21.1 21.1 25.3
AA.3697 AA.3660 AZ.6598 AA.3663 A6.2675 AA.3663 AZ.6600 AA.3662 AA.3662
25.6 28.4 30.2 30.4 31.0 31.1 31.2 31.5 32.5
AA.3713 A6.5662 A6.2675 AA.3655 AA.3660 AZ.6599 AA.3712 AZ.6605 AA.3489
32.7 33.0 33.4 33.5 33.7 34.2 34.5 34.7 34.7
AZ.6601 AZ.6603 AA.3525 AA.3655 A6.5667 AA.3534 A6.2686 AA.3511 A6.5667
35.1 35.3 36.8 37.2 37.3 37.7 37.8 37.8 38.2
AZ.6601 A6.2679 AZ.6598 A6.5662 AA.3496 AA.3511 AA.3712 AZ.6605 AA.3518
38.6 39.1 39.2 39.5 39.8 40.3 40.4 40.7 40.8
AZ.6599 AZ.6600 AA.3713 AA.3697 AA.3514 AZ.6603 AA.3489 A6.2682 A6.2678
40.9 41.9 42.0 44.4 44.9 45.0 45.2 45.8 46.5
A6.2671 F4.6704 AA.3516 A6.2685 A6.2683 A6.2686 A6.2680 AA.3531 AA.3522
47.3 47.3 47.7 48.6 48.8 49.9 53.3 53.8 55.6
AA.3517 AA.3527 AA.3520 F4.6704
56.0 57.1 57.8 67.5
Because the samples A6.2679 and A6.2682 present extremely low sequencing depths we decide to remove them from our dataset. Since we decided to proceed with a paired analysis, when removing one sample we remove its paired one too.
# Remove the two samples with very low sequencing depth
mask_remove_low_coverage <- substr(colnames(coadse.paired),9,12) %in% c("2679", "2682")
coadse.paired <- coadse.paired[,!mask_remove_low_coverage]
dge.paired <- dge.paired[,!mask_remove_low_coverage]
summary(coadse.paired$type)
normal tumor
36 36
We verify that after the filtering of the samples with low library size, our paired dataset contains 36 samples.
# coverage_filtered plot
par(mfrow = c(1, 1), mar = c(4, 5, 1, 1))
ord <- order(dge.paired$sample$lib.size)
barplot(dge.paired$sample$lib.size[ord]/1e+06, las = 1, ylab = "Millions of reads", xlab = "Samples",
col = c("cyan", "orange")[coadse.paired$type[ord]] )
legend("topleft", c("Normal", "Tumor"), fill = c("cyan", "orange"), inset = 0.01)
Figure 2: Library sizes in increasing order after filtering
Moreover, in figure 2 we can observe again the sorted histogram of library sizes per sample after the filtering.
In order to collect an even broader understanding of the dataset we then furher explore the sequencing depth highlighting female and male samples.
ord <- order(dge.paired$sample$lib.size)
barplot(dge.paired$sample$lib.size[ord]/1e+06, las = 1, ylab = "Millions of reads", xlab = "Samples", col = c("cyan", "orange")[coadse.paired$gender[ord]] )
legend("topleft", c("Female", "Male"), fill = c("cyan", "orange"), inset = 0.01)
Figure 3: Library sizes with respect to gender
In figure 3 we can see that not only we have a similar number of female and male samples, but that their sequencing depth is quite well equilibrated as we cannot clearly identify any defined clustering event.
Even after the filtering out of the samples with low library sizes values, some samples still present different sequencing depth and also there may be sample-specific biase related to sample preparation. For this reason, we need to carry out a normalization procedure.
assays(coadse.paired)$logCPM <- cpm(dge.paired, log = TRUE, prior.count = 0.25)
We are then interested in exploring the distribution of the expression levels through all the samples in terms of logarithmic CPM units.
Figure 4: Non-parametric density distribution of expression profiles per sample
In figure 4 we can observe a non-parametric density distribution of expression profiles per sample in terms of logarithmic units. As expected the expression values per sample follow a bimodal distribution with an early peak that corresponds to lowly-expressed genes and a later peak that correponds to highly-expressed ones. In order to provide a clearer visualization, we decided to divide the dataset into two smaller subsets, specifically tumor samples and normal samples.
Figure 5: Non-parametric density distribution of expression profiles per sample
Figure 5 shows the expression levels in terms of logCPM for tumors and control samples separately. We can not appreciate any substancial difference between the distributions of the tumor and normal samples.
We are now interested in investigating if there is any gene which has very low expression values so that we can exclude them from our dataset. In fact, if a gene is not expressed, then it can not be differentially expressed and so including them in our analysis would only add noise.
Figure 6: Distribution of average expression level per gene
In order to do so, we calculate the average expression of each gene for all the samples and plot their distribution in Figure 6.
From the distribution of figure 6 we could already visually identify a cutoff to define the too lowly expressed genes. Nontheless, we opted for a more precise approach to calculate the cutoff.
cpmcutoff <- round(10/min(dge.paired$sample$lib.size/1e+06), digits = 1)
sprintf("Cutoff: %s", cpmcutoff)
[1] "Cutoff: 0.7"
nsamplescutoff <- min(table(coadse.paired$gender))
mask <- rowSums(cpm(dge.paired) > cpmcutoff) >= nsamplescutoff
coadse.filt <- coadse.paired[mask, ]
dge.filt <- dge.paired[mask, ]
Once we have identified the cutoff, we filter out the genes that are below it.
par(mar = c(4, 5, 1, 1))
h <- hist(avgexp, xlab = expression("Expression level (" * log[2] * "CPM)"), main = "",
las = 1, col = "grey", cex.axis = 1.2, cex.lab = 1.5)
x <- cut(rowMeans(assays(coadse.filt)$logCPM), breaks = h$breaks)
lines(h$mids, table(x), type = "h", lwd = 10, lend = 1, col = "darkred")
legend("topright", c("All genes", "Filtered genes"), fill = c("grey", "darkred"))
Figure 7: Distribution of average expression level per gene highlighting the filtering
We can visually observe which genes have been left out from the datatset in Figure 7. Before proceeding with the normalization steps, we prefer to save a copy of the unnormalized data in case we later need to work on it again.
saveRDS(coadse.filt, file.path("results", "coadse.filt.unnorm.rds"))
saveRDS(dge.filt, file.path("results", "dge.filt.unnorm.rds"))
Since different samples may have different RNA compositions and this could be problematic for further analyses we need to take it into account and normalize the data. We estimate a normalization factor for each library using the Trimmed Mean of M-Values and apply it to our data.
dge.filt <- calcNormFactors(dge.filt)
assays(coadse.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)
Again, we like to keep on with the good practice of storing intermediate generated files.
saveRDS(coadse.filt, file.path("results", "coadse.filt.rds"))
saveRDS(dge.filt, file.path("results", "dge.filt.rds"))
Figure 8: MA-plots of the tumor samples
In Figure 8 we observe the MA-plots for the Tumor samples subset. We can see that even after the between and within normalization steps, we still have some artifacts in our data as we can observe for example in 8 B, E, F, J, Z, AI, and AD.
Figure 9: MA-plots of the control samples
In Figure 9 we observe the MA-plots for the Control samples subset and we cannot identify any important expression levels biases in any of the normal samples.
The next step will be the search for potential surrogate of batch effect indicators. Batch effect include sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological variables in the study. As normalization of the data can not always remove completely the batch effect, we are going to identify this effect, adjust for it and eventually correct it.
Given that each sample names corresponds to a [TCGA barcode] (https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples.
tss <- substr(colnames(dge.filt), 6, 7)
table(data.frame(TYPE=coadse.filt$type, TSS=tss))
TSS
TYPE A6 AA AZ F4
normal 9 20 6 1
tumor 9 20 6 1
samplevial <- substr(colnames(dge.filt), 14, 16)
table(data.frame(TYPE=coadse.filt$type, SAMPLEVIAL=samplevial))
SAMPLEVIAL
TYPE 01A 11A
normal 0 36
tumor 36 0
portionanalyte <- substr(colnames(dge.filt), 18, 20)
table(data.frame(TYPE=coadse.filt$type, PORTIONANALYTE=portionanalyte))
PORTIONANALYTE
TYPE 01R 02R 11R 21R
normal 35 1 0 0
tumor 17 6 7 6
plate <- substr(colnames(dge.filt), 22, 25)
table(data.frame(TYPE=coadse.filt$type, PLATE=plate))
PLATE
TYPE 0821 0826 1410 1653 1672 1723 1774 1839 A32Z
normal 0 0 0 1 1 9 4 6 15
tumor 9 3 3 1 0 9 4 6 1
center <- substr(colnames(dge.filt), 27, 28)
table(data.frame(TYPE=coadse.filt$type, CENTER=center))
CENTER
TYPE 07
normal 36
tumor 36
From this information we can make the following observations:
TSS: Samples were collected across 4 different tissue source sites (TSS), with poor representation of one of them (F4).
Sample Vial: 36 samples belong to solid tumor samples(01 code) whereas 36 do not (11 code).
Portions & Analyte: samples where distributed in different portions and analytes combinations.
Plates: Samples were sequenced in 9 different plates.
Center: All samples were sequenced at the same center.
The analyte code of all samples is a R (RNA samples), as expected.
Since all the sample were analysed in the same center, that is for sure not a cause of batch effect. Moreover, the cross-tabulation tables of sample vial, portion analytes and plates contain too many zero entries which implies that we do not have enough information identify any batch effect with this information. On the other hand, TSS presents values that allow further investigation to detect batch effect.
logCPM <- cpm(dge.paired, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(tss))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(coadse.filt)
outcome <- paste(substr(colnames(coadse.filt), 9, 12), as.character(coadse.filt$type), sep="-")
names(outcome) <- colnames(coadse.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(tss))), fill=sort(unique(batch)))
Figure 10: Hierarchical clustering of the samples
We examine now how samples group together by hierarchical clustering and multidimensional scaling, annotating the outcome of interest and the the surrogate of batch indicator. We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 10.
We can observe that samples cluster primarily by sample type, tumor or normal, which is a good sign for our analysis since normal and tumor are our main outcomes of interest.
plotMDS(dge.paired, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(tss))),
fill=sort(unique(batch)), inset=0.05)
Figure 11: Multidimensional scaling plot of the samples
In Figure 11 we show the corresponding MDS plot. Here we see more clearly that the first source of variation separates tumor from normal samples.
From both the hierarchical plot and de MDS plot we can not identify any batch effect for the TSS. For this reason, we proceed the analysis without adjusting or removing any batch effect.
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] geneplotter_1.60.0 annotate_1.60.1
[3] XML_3.98-1.19 AnnotationDbi_1.44.0
[5] lattice_0.20-38 edgeR_3.24.3
[7] limma_3.38.3 SummarizedExperiment_1.12.0
[9] DelayedArray_0.8.0 BiocParallel_1.16.6
[11] matrixStats_0.54.0 Biobase_2.42.0
[13] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[15] IRanges_2.16.0 S4Vectors_0.20.1
[17] BiocGenerics_0.28.0 knitr_1.22
[19] BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 RColorBrewer_1.1-2 compiler_3.5.3
[4] BiocManager_1.30.4 highr_0.8 XVector_0.22.0
[7] bitops_1.0-6 tools_3.5.3 zlibbioc_1.28.0
[10] bit_1.1-14 digest_0.6.18 memoise_1.1.0
[13] RSQLite_2.1.1 evaluate_0.13 Matrix_1.2-17
[16] DBI_1.0.0 yaml_2.2.0 xfun_0.6
[19] GenomeInfoDbData_1.2.0 stringr_1.4.0 bit64_0.9-7
[22] locfit_1.5-9.1 grid_3.5.3 rmarkdown_1.12
[25] bookdown_0.9 blob_1.1.1 magrittr_1.5
[28] codetools_0.2-16 htmltools_0.3.6 xtable_1.8-4
[31] KernSmooth_2.23-15 stringi_1.4.3 RCurl_1.95-4.12
After the data filtering and normalization, we are interested in identifying gene expression changes and associated p-values between tumor and control samples. For this task, we used the R/Bioconductor packages sva and limma. We are adopting linear regression models, which in general are used in order to represent, in the most accurate way possible, something which is very complex. In the specific case of DE analysis, the model is used as a predictor of gene expression values. For the purpose of fitting the said regression model, it is crucial to build a design matrix. The design matrix is a n x m matrix, where n is the number of samples and m is the number of coeffients which need to be estimated or the types of samples, like cases and controls. It needs to be defined for which factors to adjust. In our case we are working with a paired design and we would have liked to adjust for gender.
We decide to proceed by only adjusting for the patient id, as it is already implicitly adjusting for gender; as we are working with a paired design, gender is identical between every pair of samples, which would make reduntand to adjust for it and would create technical problems during the inversion step of the matrix in the least square algorithm steps.
In the follwing step we create the design matrix, only adjusting for the patient id. We verify as well wether the matrix is full rank and so is invertible with no problem, which is the case.
pid <- substr(colnames(dge.filt), 9, 12)
mod <- model.matrix(~ factor(type) + factor(pid), colData(coadse.filt))
mod0 <- model.matrix(~ factor(pid), colData(coadse.filt))
#Verify wether is full rank --> yes!
is.fullrank(mod)
[1] TRUE
We use the sva package for identifying and removing batch effects and other unwanted sources of variation.
sv <- sva(assays(coadse.filt)$logCPM, mod, mod0)
Number of significant surrogate variables is: 8
Iteration (out of 5 ):1 2 3 4 5
sv$n
[1] 8
The SVA algorithm has found 8 surrogate variables which we add to our model.
modsv <- cbind(mod, sv$sv)
mod0sv <- cbind(mod0, sv$sv)
In order to compare gene expression values between samples we need to adjust for the mean-variance realtionship.
barplot(sort(dge.filt$samples$lib.size)/1e+06, xlab = "Samples", ylab = "Library sizes (Millions)", col=rgb(0.7, 0.1, 0.3))
Figure 12: Library sizes among samples
This step could be executed with both limma trend and limma voom. We decided to proceed with the voom() function, since as we can observe in the Figure 12, there are big differences in library sizes among the samples.
So, we calculate the weights that estimate the mean-variance realtionship at gene-by-sample level with the voom() function. The obtained weights will be used to fit the model. Morevoer, at the same time, it performs a between-sample normalization.
v <- voom(dge.filt, modsv, plot=TRUE, normalize.method="quantile")
Figure 13: Mean-variance trend
Next, we fit a linear model with the fit() function.
# Fit linear model for each gene given a series of arrays
fit<- lmFit(v,modsv)
We then calculate the moderated t-statistics through the eBayes() function.
# compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes moderation of the standard errors towards a common value.
fit <- eBayes(fit)
FDRcutoff <- 0.01
Next, we classified the obtained test statistics as up, down or not significant with the help of the decideTest function. The FDR threshold used for this test is corresponds to 1%, as in cancer data is usual to have a lot of changes and so we prefer to be strict.
#Classify a series of related t-statistics as up, down or not significant.
res <- decideTests(fit, p.value = FDRcutoff)
summary(res)
(Intercept) factor(type)tumor factor(pid)2675 factor(pid)2678
Down 97 4224 1 3
NotSig 1760 3684 12965 12963
Up 11110 5059 1 1
factor(pid)2680 factor(pid)2683 factor(pid)2685 factor(pid)2686
Down 1 1 3 1
NotSig 12966 12965 12963 12966
Up 0 1 1 0
factor(pid)3489 factor(pid)3496 factor(pid)3511 factor(pid)3514
Down 5 1 0 2
NotSig 12959 12964 12967 12962
Up 3 2 0 3
factor(pid)3516 factor(pid)3517 factor(pid)3518 factor(pid)3520
Down 0 0 0 1
NotSig 12966 12967 12967 12965
Up 1 0 0 1
factor(pid)3522 factor(pid)3525 factor(pid)3527 factor(pid)3531
Down 0 0 1 1
NotSig 12965 12965 12966 12965
Up 2 2 0 1
factor(pid)3534 factor(pid)3655 factor(pid)3660 factor(pid)3662
Down 0 0 3 2
NotSig 12966 12965 12960 12962
Up 1 2 4 3
factor(pid)3663 factor(pid)3697 factor(pid)3712 factor(pid)3713
Down 3 1 0 1
NotSig 12964 12966 12966 12964
Up 0 0 1 2
factor(pid)5662 factor(pid)5667 factor(pid)6598 factor(pid)6599
Down 4 1 0 1
NotSig 12962 12966 12967 12966
Up 1 0 0 0
factor(pid)6600 factor(pid)6601 factor(pid)6603 factor(pid)6605
Down 2 1 5 0
NotSig 12965 12965 12962 12967
Up 0 1 0 0
factor(pid)6704
Down 1 1928 1251 1430 420 1085 95 7 3
NotSig 12966 9327 10863 10207 12144 10780 12742 12959 12964
Up 0 1712 853 1330 403 1102 130 1 0
We extract the genes chromosome to look at the distribution of the differential expressed genes among the chromosomes, again for the purpose of getting an overview of the data we will be working with.
genesmd <- data.frame(chr = as.character(sub("\\_.*", "", seqnames(rowRanges(coadse.filt)))), symbol = rowData(coadse.filt)[, 1], stringsAsFactors = FALSE)
#Extract a table of the top-ranked genes from a linear model fit.
fit$genes <- genesmd
tt <- topTable(fit, coef = 2, n = Inf)
genes<-table(as.character(sub("\\chr*", "",tt$chr[tt$adj.P.Val < FDRcutoff])))
par(las=2) # make label text perpendicular to axis
par(mar=c(5,8,4,2)) # increase y-axis margin.
names.arg=c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y","Un")
barplot(genes, cex.names=0.7, names.arg = names.arg, col=rgb(0.7, 0.1, 0.3))
Figure 14: Distribution of DE genes among chromosomes
sort(table(tt$chr[tt$adj.P.Val < FDRcutoff]), decreasing = TRUE)
chr1 chr19 chr2 chr11 chr3 chr17 chr12 chr6 chr7 chr5 chr9 chrX
948 676 613 612 534 529 489 430 424 420 405 362
chr16 chr4 chr10 chr14 chr15 chr8 chr20 chr22 chr13 chr18 chr21 chrY
357 347 344 341 276 276 263 221 162 124 114 15
chrUn
1
Now, we produce two diagnostic plots for limma DE analysis.
par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt$P.Value, xlab = "Raw P-values", main = "A)", las = 1, col=rgb(0.7, 0.1, 0.3))
qqt(fit$t[, 2], df = fit$df.prior + fit$df.residual, main = "B)", pch = ".", cex = 3)
abline(0, 1, lwd = 2, col= "red")
Figure 15: Diagnostic plots for limma DE analysis
In the A plot of figure 15 we visualize the raw p-values distribution: it is mainly uniform a part from the very high peak on the left which corresponds to the significantly DE genes. In the B Q-Q plot of figure 15 we observe that our gene expression values are not following a normal distribution. In fact, if that was the case, we would expect the two lines overlapping and no DE genes.
One interesting aspect we would like to systematically assess is the accuracy of the model with different combinations of adjusting factors and when using different algorithms when adjusting for the mean variance relationship. For doing this, we would need to use a dataset of genes which are known to be differentially expressed between patients with colon cancer and control individuals with no colon cancer. At the moment, we could not find a suitable dataset for this task but in the future, to expand the analysis, it would be interesting to investigate this aspect further. Lastly, we just collect our set of differential expressed genes to proceed with further analyses.
We proceed to filter only those genes with FDR < 1% and a logFC > 2 (a minimum 100% change in expression). We then separately observe the upregulated genes in tumor (logFC>0) and the downregulated ones (logFC<0). We decide to investigate them by sorting them by logFC.
tt.filt <- tt[tt$adj.P.Val < FDRcutoff,] #filtering by FDR < 0.01
tt.filt <- tt.filt[abs(tt.filt$logFC)>2,] # filtering by logFC > 2
all_DEgenes <- tt.filt
up_regulated <- all_DEgenes[which(all_DEgenes$logFC > 0),] # DE genes up regulated
down_regulated <- all_DEgenes[which(all_DEgenes$logFC < 0),] # DE genes down regulated
length(rownames(all_DEgenes)) # how many DE genes we have
[1] 895
length(rownames(up_regulated)) # Up-reg DE
[1] 371
length(rownames(down_regulated)) # Down-reg DE
[1] 524
down_regulated <- down_regulated[order(down_regulated$logFC),]
up_regulated <- up_regulated[order(up_regulated$logFC, decreasing = TRUE),]
print(head(down_regulated$symbol, n = 8))
[1] "PPP1R16B" "CA11" "LVRN" "CLCN1" "WNT10A" "HDAC2"
[7] "OPRK1" "SCPEP1"
print(head(up_regulated$symbol, n = 8))
[1] "NT5DC2" "COL13A1" "COPA" "CDH6" "FZD9" "LCE2A" "ESRP2"
[8] "DPH1"
After the filtering, the total number of DE genes is 895, from which 371 are up regulated and 524 are down regulated.
tt.filt <- tt[tt$adj.P.Val < FDRcutoff,] #filtering by FDR < 0.01
all_DEgenes <- tt.filt[abs(tt.filt$logFC)>2,] # filtering by logFC > 2
DEgenes <- rownames(all_DEgenes)
limma::plotMA(fit, coef = 2, status = rownames(fit$lods) %in% DEgenes, legend = FALSE,
main = "MA plot", hl.pch = 46, hl.cex = 2, bg.pch = 46, bg.cex = 2, las = 1)
legend("bottomright", c("DE genes", "Not DE genes"), fill = c("red", "black"), inset = 0.01)
Figure 16: MA plot of DE genes with logFC > 2 and FDR cutoff < 0
01.
saveRDS(all_DEgenes , file.path("results", "alldegenes.rds"))
In figure 16 we can see the distribution of the filtered DE genes. One reassuring thing here is that the dots are centered around the value zero which indicates that there are not expression-level dependent biases.
Lastly, we want to visualize the DE genes in a Volcano plot. We set two different thresholds: the log fold change threshold and the adjusted p-value threshold.
EnhancedVolcano(tt,
lab = tt$symbol,
x = 'logFC',
y = 'adj.P.Val',
xlim = c(-8, 8),
title = 'Volcano Plot',
pCutoff = 10e-16,
FCcutoff = 2,
transcriptPointSize = 1.0,
transcriptLabSize = 3.0)
Warning: Removed 1 rows containing missing values (geom_point).
Warning: Removed 1 rows containing missing values (geom_text).
Figure 17: Volcano plot
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GSVA_1.30.0 GSVAdata_1.18.0
[3] hgu95a.db_3.2.3 GSEABase_1.44.0
[5] org.Hs.eg.db_3.7.0 xtable_1.8-4
[7] GOstats_2.48.0 graph_1.60.0
[9] Category_2.48.1 Matrix_1.2-17
[11] EnhancedVolcano_1.0.1 ggrepel_0.8.1
[13] ggplot2_3.1.1 calibrate_1.7.2
[15] MASS_7.3-51.4 sva_3.30.1
[17] genefilter_1.64.0 mgcv_1.8-28
[19] nlme_3.1-139 geneplotter_1.60.0
[21] annotate_1.60.1 XML_3.98-1.19
[23] AnnotationDbi_1.44.0 lattice_0.20-38
[25] edgeR_3.24.3 limma_3.38.3
[27] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[29] BiocParallel_1.16.6 matrixStats_0.54.0
[31] Biobase_2.42.0 GenomicRanges_1.34.0
[33] GenomeInfoDb_1.18.2 IRanges_2.16.0
[35] S4Vectors_0.20.1 BiocGenerics_0.28.0
[37] knitr_1.22 BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
[4] Rgraphviz_2.26.0 tools_3.5.3 R6_2.4.0
[7] DBI_1.0.0 lazyeval_0.2.2 colorspace_1.4-1
[10] withr_2.1.2 tidyselect_0.2.5 bit_1.1-14
[13] compiler_3.5.3 labeling_0.3 bookdown_0.9
[16] scales_1.0.0 RBGL_1.58.2 stringr_1.4.0
[19] digest_0.6.18 rmarkdown_1.12 AnnotationForge_1.24.0
[22] XVector_0.22.0 pkgconfig_2.0.2 htmltools_0.3.6
[25] highr_0.8 rlang_0.3.4 RSQLite_2.1.1
[28] shiny_1.3.2 dplyr_0.8.1 RCurl_1.95-4.12
[31] magrittr_1.5 GO.db_3.7.0 GenomeInfoDbData_1.2.0
[34] Rcpp_1.0.1 munsell_0.5.0 stringi_1.4.3
[37] yaml_2.2.0 zlibbioc_1.28.0 plyr_1.8.4
[40] grid_3.5.3 blob_1.1.1 promises_1.0.1
[43] crayon_1.3.4 splines_3.5.3 locfit_1.5-9.1
[46] pillar_1.3.1 codetools_0.2-16 glue_1.3.1
[49] evaluate_0.13 BiocManager_1.30.4 httpuv_1.5.1
[52] gtable_0.3.0 purrr_0.3.2 assertthat_0.2.1
[55] xfun_0.6 mime_0.6 later_0.8.0
[58] survival_2.44-1.1 tibble_2.1.1 shinythemes_1.1.2
[61] memoise_1.1.0
In many differential gene expression analyses, the magnitude of gene expression changes may be small and very few significant DE genes will be idenified after having adjusted for multiple testing. For this reason, we want to see if there are small but consistent changes occuring for a number of genes operating in a common pathway. We go through this workflow by assessing DE genes direclty at the geneset level, starting from the expression data themselves. We approach this analysis by using the GSEA algorithm.
To start with the GSEA, we start putting our expression data and the collection of gene sets from the GSVA[http://www.bioconductor.org/packages/release/data/experiment/html/GSVAdata.html] dataset c2BroadSets in a GeneSetCollection object, as it contains curated gene sets, suggesting a good set representation for biological data analysis.
# collect gene sets
data(c2BroadSets)
gsc <- GeneSetCollection(c2BroadSets)
gsc
GeneSetCollection
names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
length(gsc) # number of gene sets in the collection
[1] 3272
head(names(gsc)) # some of the gene sets in the collection
[1] "NAKAMURA_CANCER_MICROENVIRONMENT_UP"
[2] "NAKAMURA_CANCER_MICROENVIRONMENT_DN"
[3] "WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP"
[4] "WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN"
[5] "WINTER_HYPOXIA_UP"
[6] "WINTER_HYPOXIA_DN"
In order to reduce the amount of data to analyse, we are going to restrict the analysis to the pathways KEGG, REACTOME and BIOCARTA.
gsc <- gsc[c(grep("^KEGG", names(gsc)),
grep("^REACTOME", names(gsc)), grep("^BIOCARTA", names(gsc)))]
gsc
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
unique identifiers: 55902, 2645, ..., 8544 (6744 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
length(gsc)
[1] 833
Now we can start our GSEA analysis, using the algorithm refered as simple GSEA (Irizarry et al. (2009)[https://journals.sagepub.com/doi/abs/10.1177/0962280209351908]).
First, we need to map the identifiers from the gene sets to the identifiers of the data we are going to analyze. Furthermore, we create an incidence matrix (Im) which indicates which genes belong to what gene set.
gsc <- mapIdentifiers(gsc, AnnoOrEntrezIdentifier(metadata(coadse.filt)$annotation))
gsc
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
unique identifiers: 55902, 2645, ..., 8544 (6744 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
Im <- incidence(gsc)
dim(Im)
[1] 833 6744
Im[1:2, 1:10]
55902 2645 5232 5230 5162 5160 5161 55276
KEGG_GLYCOLYSIS_GLUCONEOGENESIS 1 1 1 1 1 1 1 1
KEGG_CITRATE_CYCLE_TCA_CYCLE 0 0 0 0 1 1 1 0
7167 84532
KEGG_GLYCOLYSIS_GLUCONEOGENESIS 1 1
KEGG_CITRATE_CYCLE_TCA_CYCLE 0 0
We can see that several genes are present in more than one gene set.
Once done it, the following step will be to discard those genes that are not present in our data, and also discard all genes in our data that are not annotated in any gene sets.
Im <- Im[, colnames(Im) %in% rownames(all_DEgenes)]
dim(Im)
[1] 833 312
coadse.filt <- coadse.filt[colnames(Im), ]
dim(coadse.filt)
[1] 312 72
dge.filt <- dge.filt[colnames(Im),]
dim(dge.filt)
[1] 312 72
As a result we have discarted almost 6400 genes that didn’t belong to our data.
In order to determine whether an a priori defined set of genes shows statistically significant differences between tumor vs normal phenotypes, we are going to perform a Gene Set Enrichment Analysis to introduce a method for pathway analysis assessing DE directly at gene set level.
The main scope of the GSEA approach is, as mentioned, to detect consistent and robust changes in expression within a gene set. To do so, we calculate the Enrichment Score (ES) as a z-score.
We decided to filter out the genesets that contain less than 10 genes. In fact, very small gene sets may induce little reliability and increase type I errors (false positives). Then we also calculate the z-score.
Im <- Im[rowSums(Im) >= 5, ]
tGSgenes <- all_DEgenes[match(colnames(Im), rownames(all_DEgenes)), "t"] # calculate all t-statistics for genes in each gene set
zS <- sqrt(rowSums(Im)) * (as.vector(Im %*% tGSgenes)/rowSums(Im)) # calculating Zscore statistic
length(zS)
[1] 98
head(zS, n=10)
KEGG_GLYCOLYSIS_GLUCONEOGENESIS
-14.054563
KEGG_FATTY_ACID_METABOLISM
-22.468084
KEGG_STEROID_HORMONE_BIOSYNTHESIS
-3.128532
KEGG_PURINE_METABOLISM
-18.179222
KEGG_PYRIMIDINE_METABOLISM
-39.806624
KEGG_ARGININE_AND_PROLINE_METABOLISM
8.772841
KEGG_RETINOL_METABOLISM
3.420898
KEGG_PORPHYRIN_AND_CHLOROPHYLL_METABOLISM
17.655980
KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450
-16.873131
KEGG_DRUG_METABOLISM_CYTOCHROME_P450
3.158353
The incidence matrix after the filtering has lost 200 gene sets approximately. As we can see, know we have 618 pathways that contain a total of 4174 genes.
qqnorm(zS)
abline(0,1)
Figure 18: Q-Q Plot of gene set Z-scores
As we can observe in the Q-Q plot of figure 18 we can detect that the gene sets Z-scores do not follow a normal distribution (expected but the null hypothesis of no DE gene sets), which indicates a promising number of DE gene sets.
Just to have an overview we rank the gene sets according to Z-scores, and show in decreasing order.
rnkGS_Z <- sort(abs(zS), decreasing = TRUE)
head(rnkGS_Z, n=20)
REACTOME_SIGNALLING_BY_NGF
43.99818
KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION
42.76569
KEGG_PYRIMIDINE_METABOLISM
39.80662
KEGG_OLFACTORY_TRANSDUCTION
37.54749
REACTOME_P75_NTR_RECEPTOR_MEDIATED_SIGNALLING
33.75918
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS
33.30035
REACTOME_G1_S_TRANSITION
33.27730
KEGG_CALCIUM_SIGNALING_PATHWAY
33.20118
REACTOME_DIABETES_PATHWAYS
33.04180
KEGG_WNT_SIGNALING_PATHWAY
32.83759
REACTOME_DOWNSTREAM_EVENTS_IN_GPCR_SIGNALING
32.35510
KEGG_NOTCH_SIGNALING_PATHWAY
32.23747
KEGG_TIGHT_JUNCTION
32.13952
REACTOME_CELL_CELL_ADHESION_SYSTEMS
32.06619
REACTOME_FORMATION_OF_PLATELET_PLUG
31.95290
KEGG_DRUG_METABOLISM_OTHER_ENZYMES
31.52333
KEGG_LONG_TERM_POTENTIATION
31.51740
REACTOME_HEMOSTASIS
31.50406
REACTOME_SIGNALING_IN_IMMUNE_SYSTEM
30.68054
REACTOME_OLFACTORY_SIGNALING_PATHWAY
30.49469
Let’s define a function called plotGS to produce a scatter plot, for a given gene set, of the mean expression values comparing Tumor vs Normal phenotypes.
plotGS <- function(se, gs, pheno,dge, ...) {
l <- levels(colData(se)[, pheno])
idxSamples1 <- colData(se)[, pheno] == l[1]
idxSamples2 <- colData(se)[, pheno] == l[2]
exps1 <- rowMeans(assays(se)$logCPM[gs, idxSamples1])
exps2 <- rowMeans(assays(se)$logCPM[gs, idxSamples2])
rng <- range(c(exps1, exps2))
plot(exps1, exps2, pch = 21, col = "black", bg = "black", xlim = rng, ylim = rng,
xlab = l[1], ylab = l[2], ...)
abline(a = 0, b = 1, lwd = 2, col = "red")
nn<-dge[rownames(dge$genes) %in% gs,]$genes$symbol
text(exps1, exps2, labels = nn, cex = 0.8, pos=2)
}
We perform one sample z-test and compute the number of DE gene sets according to the adjusted p-value with and FDR of 1%.
pv <- pmin(pnorm(zS), 1 - pnorm(zS))
sum(pv < 0.05)
[1] 97
pvadj <- p.adjust(pv, method = "fdr")
DEgs <- names(pvadj)[which(pvadj < 0.01)]
length(DEgs)
[1] 94
head(DEgs, n = 30)
[1] "KEGG_GLYCOLYSIS_GLUCONEOGENESIS"
[2] "KEGG_FATTY_ACID_METABOLISM"
[3] "KEGG_STEROID_HORMONE_BIOSYNTHESIS"
[4] "KEGG_PURINE_METABOLISM"
[5] "KEGG_PYRIMIDINE_METABOLISM"
[6] "KEGG_ARGININE_AND_PROLINE_METABOLISM"
[7] "KEGG_RETINOL_METABOLISM"
[8] "KEGG_PORPHYRIN_AND_CHLOROPHYLL_METABOLISM"
[9] "KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450"
[10] "KEGG_DRUG_METABOLISM_CYTOCHROME_P450"
[11] "KEGG_DRUG_METABOLISM_OTHER_ENZYMES"
[12] "KEGG_ABC_TRANSPORTERS"
[13] "KEGG_MAPK_SIGNALING_PATHWAY"
[14] "KEGG_CALCIUM_SIGNALING_PATHWAY"
[15] "KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION"
[16] "KEGG_CHEMOKINE_SIGNALING_PATHWAY"
[17] "KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION"
[18] "KEGG_CELL_CYCLE"
[19] "KEGG_OOCYTE_MEIOSIS"
[20] "KEGG_UBIQUITIN_MEDIATED_PROTEOLYSIS"
[21] "KEGG_LYSOSOME"
[22] "KEGG_ENDOCYTOSIS"
[23] "KEGG_APOPTOSIS"
[24] "KEGG_WNT_SIGNALING_PATHWAY"
[25] "KEGG_NOTCH_SIGNALING_PATHWAY"
[26] "KEGG_HEDGEHOG_SIGNALING_PATHWAY"
[27] "KEGG_TGF_BETA_SIGNALING_PATHWAY"
[28] "KEGG_AXON_GUIDANCE"
[29] "KEGG_FOCAL_ADHESION"
[30] "KEGG_CELL_ADHESION_MOLECULES_CAMS"
After filtering by adjusted p value we obtained a total of 94 gene sets.
Now, we plot the mean expression values per gene for 6 gene sets of interest selected by the Z-score.
top1G <- colnames(Im)[which(Im[names(rnkGS_Z)[1], ] == 1)]
genesGS1 <- rowData(coadse.filt)[top1G,1] # to know the gene symbol
genesGS1
[1] "ADCY5" "PDPK1" "RASGRF2" "GSK3A" "ADCYAP1R1" "HDAC2"
[7] "PSEN1" "RAPGEF1" "ITGB3BP" "MEF2A" "RTN4" "PCSK6"
top2G <- colnames(Im)[which(Im[names(rnkGS_Z)[2], ] == 1)]
genesGS2 <- rowData(coadse.filt)[top2G,1]
genesGS2
[1] "CHRM1" "GRIN2D" "HTR4" "PTAFR" "PTGER4" "AGTR2"
[7] "GALR1" "ADCYAP1R1" "OPRK1" "GABRA4" "GPR50" "TAAR8"
[13] "GRM4"
top10G <- colnames(Im)[which(Im[names(rnkGS_Z)[10], ] == 1)]
genesGS10 <- rowData(coadse.filt)[top10G,1]
genesGS10
[1] "PPP3R2" "PPP3CC" "CHP1" "CREBBP" "EP300" "BTRC" "PPP2R5E"
[8] "FZD6" "FZD7" "FZD9" "AXIN2" "WNT16" "WNT10A" "PSEN1"
top15G <- colnames(Im)[which(Im[names(rnkGS_Z)[15], ] == 1)]
genesGS15 <- rowData(coadse.filt)[top15G,1]
genesGS15
[1] "PDPK1" "RASGRP2" "PDGFA" "PDGFB" "PSAP" "DAGLA" "HRG"
top37G <- colnames(Im)[which(Im[names(rnkGS_Z)[37], ] == 1)]
genesGS37 <- rowData(coadse.filt)[top37G,1]
genesGS37
[1] "PSMC6" "PSMB7" "PSMD5" "CLASP1" "NUP85" "CENPA" "ITGB3BP"
[8] "KIF20A" "NSL1" "CLIP1"
par(mfrow = c(3, 2), mar= c(2,5,2,2), mai=c(0.5,0.6,0.6,1))
plotGS(coadse.filt, top1G, "type", main = paste("1)",names(rnkGS_Z)[1]),dge.filt, cex.main = 2, cex.lab = 2, las = 1)
plotGS(coadse.filt, top2G, "type", main = paste("2)",names(rnkGS_Z)[2]),dge.filt, cex.main = 2, cex.lab = 2, las = 1)
plotGS(coadse.filt, top10G, "type", main = paste("3)", names(rnkGS_Z)[10]), dge.filt, cex.main = 2, cex.lab = 2, las = 1)
plotGS(coadse.filt, top15G, "type", main = paste("4)",names(rnkGS_Z)[15]),dge.filt, cex.main = 2, cex.lab = 2, las = 1)
plotGS(coadse.filt, top37G, "type", main = paste("5)",names(rnkGS_Z)[37]),dge.filt, cex.main = 2, cex.lab = 2, las = 1)
Figure 19: Scatter Plots of mean expression values for genesets of interests
We can observe that in the selected genesets the distribution of the genes is mainly downregulated in tumor cases respect to normal. In general, gene downregulation in cancer could possibly be the cause of dysregulation of important pathways, repression of apoptosis or tumor suppressors’ repression.
Now we want to investigate the specific genes that are up- and down-regulated.
In the plot 1 of figure 19, we can observe an overexpression of the MEF2A transcription factor myocyte-enhancer factor 2 known to play a role in adaptive responses during development and adult life. Even if the specific role in tumorgenesis of the MEF2 family has not been clarified yet, it has been identified that it can favor matrix degradative processes when its activation is promoted by TGF-Beta by decreasing the stability of HDACs Di Giorgio, Hancock, and Brancolini (2018). Consistently with this scenario we observe an underexpression of the HDAC2 gene, which competes for binding to the same region of MEF2.
In the plot 2 of figure 19 we observe an overexpression of the HT receptor (HTR4) which is involved in the neuronal response with the serotonergic synapse pathway Arese et al. (2018). HT receptors have been shown to be over expressed in cancer tissues and that their antagonists inhibit the HT effect to different extents and induce apoptosis Radin and Patel (2017).
In plot 3 of figure 19 we can identify an overexpression of two Frizzled Receptors(FZD7 and FZD9). FZD9 expression was reported to be upregulated in different carcinomas Qiu et al. (2016). Moreover, a dysregulation of the WNT pathway by FZD7 was related to tumorigenesis and metastasis. As mentioned, a possible disregulation of the WNT pathway is induced by the Frizzled Receptors; consistently to this we observe that expression levels of WNT10a and WNT6 are affected as well in tumor patients. A desregulation in Wnt pathway has been associated with the accumulation of the oncogenic protein beta-catein, and therefore, with the development of cancer Fearon (2011).
In plot 4 of figure 19 we identify an overexpression of the PDGF-B gene.In a mice study was identified that PDGF-B released from colon tumor cells regulated tumor growth by inducing blood vessel formation. Also they found that an elevated expression of PDGF-B was also correlated with tumor size Hsu, Huang, and Friedman (1995).
In plot 5 of figure 19 we identify an overexpression of the CENPA gene. Recent work has demonstrated that the kinetochore protein CENP-A was overexpressed in all of 11 primary human colorectal cancer tissues. It is also known that chromosome missegregation during mitosis is the main cause of aneuploidy and contributes to oncogenesis. Centromere protein (CENP)-A is the centromere-specific histone-H3-like variant essential for centromere structure, function and the assembly of the kinetochore Tomonaga et al. (2003).
boxplotgenes <- function(se, gene) {
iterations = dim(se)[2]
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
output <- data.frame(output)
colnames(output) <- c("type", "logCPM")
aa <- se[rowData(se)$symbol == gene]$type
bb<-assays(se[rowData(se)$symbol == gene])$logCPM
for(i in 1:iterations){
output$type[i] <- aa[i]
output$logCPM[i] <- bb[i]
}
output$type<-gsub(x = output$type, pattern = "1", replacement = "normal")
output$type<-gsub(x = output$type, pattern = "2", replacement = "tumor")
boxplot(logCPM ~ type, data=output, col=c("grey",rgb(0.7, 0.1, 0.3)), main=gene, ylab="logCPM")
}
par(mfrow = c(3, 3), mar= c(2,5,2,2), mai=c(0.5,0.6,0.6,1))
boxplotgenes(coadse.filt, "MEF2A")
boxplotgenes(coadse.filt, "HDAC2")
boxplotgenes(coadse.filt, "HTR4")
boxplotgenes(coadse.filt, "FZD7")
boxplotgenes(coadse.filt, "FZD9")
boxplotgenes(coadse.filt, "PDGFB")
boxplotgenes(coadse.filt, "CENPA")
Figure 20: Boxplot of logCPM expression of genes of interest between tumor and normal samples
In Figure 20 you can observe the boxplots for the logCPM expression values of the genes we just discussed between tumor and normal samples.
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ggplot2_3.1.1 calibrate_1.7.2
[3] MASS_7.3-51.4 GSVA_1.30.0
[5] GSVAdata_1.18.0 hgu95a.db_3.2.3
[7] GSEABase_1.44.0 org.Hs.eg.db_3.7.0
[9] xtable_1.8-4 GOstats_2.48.0
[11] graph_1.60.0 Category_2.48.1
[13] Matrix_1.2-17 geneplotter_1.60.0
[15] annotate_1.60.1 XML_3.98-1.19
[17] AnnotationDbi_1.44.0 lattice_0.20-38
[19] edgeR_3.24.3 limma_3.38.3
[21] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[23] BiocParallel_1.16.6 matrixStats_0.54.0
[25] Biobase_2.42.0 GenomicRanges_1.34.0
[27] GenomeInfoDb_1.18.2 IRanges_2.16.0
[29] S4Vectors_0.20.1 BiocGenerics_0.28.0
[31] knitr_1.22 BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
[4] Rgraphviz_2.26.0 tools_3.5.3 R6_2.4.0
[7] DBI_1.0.0 lazyeval_0.2.2 colorspace_1.4-1
[10] withr_2.1.2 tidyselect_0.2.5 bit_1.1-14
[13] compiler_3.5.3 bookdown_0.9 scales_1.0.0
[16] genefilter_1.64.0 RBGL_1.58.2 stringr_1.4.0
[19] digest_0.6.18 rmarkdown_1.12 AnnotationForge_1.24.0
[22] XVector_0.22.0 pkgconfig_2.0.2 htmltools_0.3.6
[25] highr_0.8 rlang_0.3.4 RSQLite_2.1.1
[28] shiny_1.3.2 dplyr_0.8.1 RCurl_1.95-4.12
[31] magrittr_1.5 GO.db_3.7.0 GenomeInfoDbData_1.2.0
[34] Rcpp_1.0.1 munsell_0.5.0 stringi_1.4.3
[37] yaml_2.2.0 zlibbioc_1.28.0 plyr_1.8.4
[40] grid_3.5.3 blob_1.1.1 promises_1.0.1
[43] crayon_1.3.4 splines_3.5.3 locfit_1.5-9.1
[46] pillar_1.3.1 codetools_0.2-16 glue_1.3.1
[49] evaluate_0.13 BiocManager_1.30.4 httpuv_1.5.1
[52] gtable_0.3.0 purrr_0.3.2 assertthat_0.2.1
[55] xfun_0.6 mime_0.6 later_0.8.0
[58] survival_2.44-1.1 tibble_2.1.1 shinythemes_1.1.2
[61] memoise_1.1.0
The Quality assessment for tumor data will be performed following the steps and the reasoning used in the Quality assesment for the paired data. For this reason, here we are only going to highlight the strategies that differ from the previous pipeline.
From the total of samples before starting to analyse the data, we proceed to do a subset that only includes the tumor data. The reason of doing this is because we are interested also in the differential expression analysis between the different stages of the Colon Adenocarcinoma Cancer.
In order to create the subset, we looked for those samples that had “tumor” as their type and discarded those with “normal” type. We also discarded those samples that had no information about the stage of the tumor.
Although the data is categorized in diverse subgroups, we will work with the following ones: stage I, stage II, stage III and stage IV, being each one the sum of all the subcategories that share the same stage number:
filter_tumor_mask <- rownames(colData(coadse)[colData(coadse)$type == "tumor",])
coadse.tumor <- coadse[, filter_tumor_mask]
coadse.tumor <- coadse.tumor[, !is.na(coadse.tumor$ajcc_pathologic_tumor_stage)]
coadse.tumor <- coadse.tumor[, coadse.tumor$ajcc_pathologic_tumor_stage != "[Not Available]"]
coadse.tumor$ajcc_pathologic_tumor_stage <- gsub(x = coadse.tumor$ajcc_pathologic_tumor_stage, pattern = "A", replacement = "")
coadse.tumor$ajcc_pathologic_tumor_stage <- gsub(x = coadse.tumor$ajcc_pathologic_tumor_stage, pattern = "B", replacement = "")
coadse.tumor$ajcc_pathologic_tumor_stage <- gsub(x = coadse.tumor$ajcc_pathologic_tumor_stage, pattern = "C", replacement = "")
table(coadse.tumor$ajcc_pathologic_tumor_stage)
Stage I Stage II Stage III Stage IV
72 168 124 63
First of all, we want to collect an overview of the data that we are working with. Starting examining the sequencing depth by plotting the total number of reads mapped to the genome per sample.
CM.4748 AA.3696 AA.3837 AA.A02H AA.A010 AA.A00E A6.2679 AA.3672 AA.3524
1.2 3.0 4.5 5.1 5.5 5.7 5.9 6.6 6.8
AA.A01K AA.A004 AA.3947 AA.A01C CA.6717 D5.6536 AA.3976 AA.3506 AA.3841
7.3 7.4 7.6 8.0 8.0 8.1 8.3 8.6 8.6
A6.2682 AA.A03F AA.3667 AA.3664 AA.A029 AA.3519 AA.3812 AA.3949 AA.3956
9.0 9.6 9.9 10.2 10.2 11.0 11.0 11.2 11.3
AA.3842 AA.3848 AA.3875 AA.3930 AA.3955 AA.A01Q A6.3807 AA.3862 AA.3679
11.5 11.5 12.1 12.1 12.1 12.2 12.3 12.3 12.5
A6.A5ZU AA.3684 AA.3542 AA.3850 AA.A00F A6.A56B AA.3560 AA.3680 AA.3715
12.6 12.7 12.8 12.8 13.0 13.1 13.2 13.2 13.2
AA.A01F AA.A01T CM.4747 AA.3681 AZ.4313 AA.3495 AA.3531 AA.3521 AA.A01Z
13.4 13.4 13.4 13.5 13.5 13.6 13.6 13.7 13.7
AA.3861 AA.3502 AA.3970 AA.A00N AA.3950 AA.A024 AA.3666 AA.3530 AA.3710
13.9 14.1 14.1 14.2 14.3 14.3 14.5 14.6 14.6
AA.3517 AA.3821 AA.3972 AA.A00D AA.3971 AA.A00L A6.3808 AA.3544 AA.3548
14.7 14.7 14.7 14.7 14.9 14.9 15.0 15.0 15.0
AA.3845 AA.3975 AA.3852 AA.A01G AZ.4315 AA.3552 AA.3939 AA.3982 AA.3815
15.0 15.0 15.1 15.1 15.1 15.2 15.2 15.2 15.3
AA.A00R A6.A567 AA.3534 AA.3941 AA.3968 AA.3846 AA.3867 AA.3973 AA.A00K
15.3 15.4 15.4 15.4 15.4 15.5 15.5 15.6 15.7
CM.6169 A6.2683 AA.A00J AA.A00U AA.A02E DM.A1DB AA.3811 AA.A00W AA.A03J
15.7 15.8 15.8 15.9 15.9 15.9 16.0 16.0 16.0
AA.3833 AA.A017 AA.A00Q CM.4752 AA.A02F AA.3518 AA.3844 AA.3529 AA.3980
16.2 16.2 16.3 16.3 16.4 16.5 16.5 16.7 16.7
AA.3520 AY.4070 AA.3522 AA.3554 AA.3860 AA.3543 AA.A00O AZ.4308 A6.2676
16.8 16.8 17.0 17.0 17.0 17.1 17.1 17.1 17.2
AA.3516 AA.3989 AA.3858 AA.A00Z AY.4071 AA.3488 AA.3561 AA.3877 AA.A01V
17.2 17.2 17.3 17.3 17.3 17.4 17.4 17.4 17.4
AA.3525 AA.3492 AA.3851 AA.A00A AA.3514 AA.3549 AA.3866 AA.3984 AA.3986
17.5 17.6 17.6 17.6 17.7 17.7 17.7 17.7 17.7
AA.3994 AA.A022 AA.3870 AZ.4684 AA.3864 AA.3979 AA.A01X A6.2680 AA.3538
17.7 17.7 17.9 17.9 18.0 18.0 18.0 18.1 18.2
AA.3556 AA.3695 AA.3814 AA.3509 AA.A01S CM.4746 AA.3869 AA.A01P AA.A02W
18.2 18.2 18.3 18.4 18.5 18.5 18.7 18.7 18.7
AA.A02R CA.5256 G4.6304 AA.3510 AA.3527 AA.3952 AA.A01D AZ.4681 CM.5341
18.8 18.8 18.8 19.0 19.0 19.0 19.0 19.0 19.0
AA.3553 A6.2685 AA.3562 AA.3855 CM.4750 AA.A01R AZ.4614 AA.3831 AA.A02J
19.1 19.2 19.2 19.2 19.2 19.4 19.6 19.7 19.7
AA.A02O AA.3819 CK.4951 A6.4107 AA.3494 AA.3532 AA.3966 AA.3688 NH.A6GC
19.8 19.9 19.9 20.0 20.0 20.1 20.6 20.7 20.7
AA.A01I AZ.4615 A6.2671 A6.2678 AA.3818 A6.2681 AA.3678 AA.3555 AA.3854
20.9 21.0 21.1 21.1 21.2 21.4 21.4 21.5 21.6
4T.AA8H G4.6310 AA.3692 D5.6920 AA.3856 AA.3872 AA.3673 AA.3693 F4.6856
21.7 21.8 21.9 21.9 22.4 22.7 23.0 23.3 23.8
G4.6306 F4.6806 D5.6924 CM.6677 CM.5861 AA.3697 CM.5864 D5.6533 CA.5796
23.8 24.3 24.5 25.1 25.4 25.6 25.8 26.3 26.6
CM.5862 F4.6855 NH.A6GB G4.6323 A6.6141 CM.6166 F4.6808 G4.6320 AY.6196
26.8 26.9 27.0 27.4 28.0 28.0 28.2 28.3 29.0
NH.A8F8 AZ.5407 A6.A566 D5.6929 DM.A288 CM.5860 D5.6927 CM.5349 G4.6309
29.0 29.1 29.2 29.3 29.4 29.8 29.8 29.9 29.9
D5.5540 AZ.6598 CM.4743 F4.6805 CK.5912 G4.6298 D5.6930 G4.6315 A6.2675
30.0 30.2 30.2 30.3 30.4 30.6 30.8 30.8 31.0
AA.3663 A6.6142 CM.6165 G4.6311 G4.6295 F4.6460 F4.6461 CM.6674 CK.5916
31.1 31.3 31.5 31.5 31.6 31.8 31.8 31.9 32.3
G4.6321 AA.3662 DM.A280 G4.6293 CM.6167 G4.6588 AU.3779 CA.5255 QL.A97D
32.4 32.5 32.5 32.5 32.6 32.6 32.7 32.9 33.0
A6.4105 AD.5900 AZ.4323 AZ.6607 D5.6530 F4.6463 AA.3655 D5.5537 F4.6807
33.2 33.2 33.2 33.2 33.2 33.2 33.5 33.5 33.5
AA.3660 A6.5666 G4.6302 D5.6535 AZ.6599 RU.A8FL CK.5915 CK.6751 CM.6172
33.7 33.8 33.9 34.0 34.2 34.2 34.3 34.5 34.5
4N.A93T AY.A8YK G4.6627 3L.AA1B AM.5821 AZ.6605 CM.5344 D5.6531 DM.A28K
34.6 34.6 34.6 34.7 34.7 34.7 34.9 34.9 34.9
D5.5538 AD.6889 G4.6294 G4.6625 A6.6652 A6.6653 D5.6539 G4.6299 AD.A5EJ
35.3 35.4 35.4 35.5 35.6 35.6 35.6 35.7 36.3
D5.6538 NH.A50V AZ.5403 CA.6719 CM.6171 DM.A28G G4.6626 A6.6137 D5.6540
36.3 36.3 36.4 36.4 36.5 36.6 36.6 36.7 36.8
G4.6297 D5.6931 A6.6140 A6.6138 AZ.4616 DM.A28E CM.6675 CM.6680 A6.2686
36.9 37.0 37.1 37.3 37.4 37.5 37.6 37.7 37.8
CM.6161 CM.4751 CM.6676 CK.6746 D5.5539 NH.A50T A6.5667 AM.5820 CM.5863
37.9 38.0 38.0 38.1 38.1 38.1 38.2 38.2 38.3
G4.6586 A6.5660 AY.6386 D5.6926 DM.A282 AY.A54L AZ.6601 A6.5664 CK.4947
38.3 38.4 38.4 38.5 38.5 38.6 38.6 38.8 38.8
AA.3526 CA.6716 A6.6654 A6.6782 DM.A0XD F4.6459 A6.5662 D5.6537 G4.6314
39.0 39.0 39.2 39.3 39.4 39.4 39.5 39.5 39.5
D5.6932 DM.A1D4 G4.6628 DM.A28A AA.3496 AU.6004 D5.6532 AA.3685 CM.6163
39.6 39.6 39.6 39.7 39.8 39.8 39.8 39.9 39.9
F4.6570 CM.6170 D5.6928 DM.A28C CM.5348 NH.A5IV AA.3511 AA.3712 F4.6809
39.9 40.0 40.0 40.1 40.2 40.2 40.3 40.4 40.4
AA.A02K AZ.6606 AZ.6608 CK.4952 QG.A5YV QG.A5YX AY.6197 D5.6529 CM.6168
40.6 40.6 40.6 40.6 40.6 40.7 40.8 40.9 41.0
DM.A28H CM.6162 CK.5914 AZ.6600 AA.3713 A6.6651 CA.5797 CM.6679 CA.6718
41.0 41.1 41.6 41.9 42.0 42.3 42.5 42.5 42.7
CM.6678 AD.6548 AA.A02Y AD.A5EK AA.3675 CK.5913 D5.7000 DM.A1HA CA.6715
42.7 42.8 43.2 43.2 43.3 43.3 43.3 43.7 44.1
D5.6922 QG.A5YW AY.A71X F4.6569 AZ.4682 AA.3489 D5.6898 CM.6164 DM.A1HB
44.2 44.3 44.4 44.5 44.6 45.2 45.2 45.3 45.9
NH.A6GA AY.5543 CM.5868 G4.6322 D5.5541 CK.6748 QG.A5Z2 DM.A285 AD.6888
46.1 46.5 46.6 46.6 46.7 46.8 46.8 47.0 47.2
F4.6854 G4.6303 DM.A28F AD.6899 CA.5254 CK.6747 DM.A1D7 DM.A28M A6.A565
47.3 47.9 48.0 48.2 48.2 48.3 48.4 48.5 48.6
QG.A5Z1 D5.6541 AY.A69D DM.A1D9 WS.AB45 DM.A0XF CM.4744 NH.A50U DM.A1DA
48.9 49.0 49.1 49.4 49.4 50.6 50.7 50.7 50.9
CK.4948 D5.6534 D5.6923 DM.A1D0 CK.4950 A6.6649 F4.6703 DM.A0X9 A6.5657
51.5 51.7 52.0 52.0 52.5 52.7 53.9 54.5 56.7
G4.6307 DM.A1D6 A6.6648 F4.6704
56.9 62.7 63.1 67.5
Figure 21: Library sizes in increasing order
Looking at the plot in figure 21, we can see two important details that we must take into consideration. The first one is that all four stages have different samples with different sequencing depth. The other one is that we observe that within the considerable differences in the sequencing depth of some samples, we can distinguish some samples on the left part of the graph that present considerably low sequencing depths: this may generate problems in further analysis and that is why we then wanted to remove these samples as shown below:
# Remove those samples with very low sequencing depth (<10)
coadse.tumor <- coadse.tumor[, dge.tumor$samples$lib.size >= 10*1e06]
dge.tumor <- DGEList(counts = assays(coadse.tumor)$counts, group = coadse.tumor$ajcc_pathologic_tumor_stage, genes = genes.tumor)
# coverage_filtered plot
par(mfrow = c(1, 1), mar = c(4, 5, 1, 1))
ord.tumor <- order(dge.tumor$sample$lib.size)
ggplot(as.data.frame(dge.tumor$samples), aes(x = row.names(dge.tumor$samples), y = dge.tumor$samples$lib.size[ord.tumor]/1000000, fill = dge.tumor$samples$group[ord.tumor])) + geom_bar(stat = "identity") + theme_bw() + theme(legend.title = element_blank(), axis.text.x=element_blank()) + ylab("Milions of reads") + xlab("Samples")
Figure 22: Library sizes in increasing order after the filtering
In this second plot(Figure 22) we can see the subset of data that we are going to use after having removed the samples with less coverage depth. In total,we have lost 21 samples from our dataset.
ord2 <- order(dge.tumor$sample$lib.size)
table(coadse.tumor$gender)
FEMALE MALE
194 212
barplot(dge.tumor$sample$lib.size[ord2]/1e+06, las = 1, ylab = "Millions of reads", xlab = "Samples",
col = c("cyan", "orange")[coadse.tumor$gender[ord2]], border = NA)
legend("topleft", c("Female", "Male"), fill = c("cyan", "orange"), inset = 0.01)
Figure 23: Library sizes with respect to gender
In the figure 23 we can see that, as happened with the paired analysis, not only we have a similar number of female and male samples, but their sequencing depth is quite well equilibrated as we cannot clearly identify any defined clustering event.
Then, we first run a within sample normalization and we also explore the distribution of the expression levels through all the samples in terms of logarithmic CPM units.
Figure 24: Non-parametric density distribution of expression profiles per sample
In figure 24 we can observe a non-parametric density distribution of expression profiles per sample which follows a bimodal distribution.
Figure 25: Non-parametric density distribution of expression profiles per sample
To provide a clearer visualization, we decided to divide the dataset into four smaller subsets, one for each stage. Figure 25 shows the expression levels in terms of logCPM for the different stages of tumor samples separately.
As we do not observe extreme differences in the distribution of the expression values in the different tumor stages, we decide to not exclude any sample from the dataset in this step.
We now want to look if there is any gene which has very low expression to exclude them from the dataset.
Figure 26: Distribution of average expression level per gene
In order to do so, we have calculated the average expression of each gene for all the samples and plotted their distribution in Figure 26.
Those genes that have very low counts should be removed prior to further analysis because a gene must be expressed at some minimal level before it is likely to be translated into a protein and also because genes statistically they are very unlikely to be assessed as significantly differential expressed. So, such genes can therefore be removed from the analysis without any loss of information.
Then, to filter the lowly-expressed genes we calculate first a minimum CPM cutoff value of expression; 10 reads per million in the sample with lowest depth. We will keep also only those genes that have this minimum level of expression in at least as many samples as the smallest group of comparison.
cpmcutoff.tumor <- round(10/min(dge.tumor$sample$lib.size/1e+06), digits = 1)
cpmcutoff.tumor
[1] 1
sprintf("Cutoff: %s", cpmcutoff.tumor)
[1] "Cutoff: 1"
nsamplescutoff.tumor <- min(table(coadse.tumor$ajcc_pathologic_tumor_stage))
nsamplescutoff.tumor
[1] 60
mask.tumor <- rowSums(cpm(dge.tumor) > cpmcutoff.tumor) >= nsamplescutoff.tumor
coadse.tumor.filt <- coadse.tumor[mask.tumor, ]
dge.tumor.filt <- dge.tumor[mask.tumor, ]
After the filtering of lowly-expressed genes, we have lost almost 6.500 genes.
dim(coadse.tumor)
[1] 20115 406
dim(coadse.tumor.filt)
[1] 13569 406
We can visually observe which genes have been left out from the datatset in Figure 7
par(mar = c(4, 5, 1, 1))
h <- hist(avgexp.tumor, xlab = expression("Expression level (" * log[2] * "CPM)"), main = "",
las = 1, col = "grey", cex.axis = 1.2, cex.lab = 1.5)
x <- cut(rowMeans(assays(coadse.tumor.filt)$logCPM), breaks = h$breaks)
lines(h$mids, table(x), type = "h", lwd = 10, lend = 1, col = "darkred")
legend("topright", c("All genes", "Filtered genes"), fill = c("grey", "darkred"))
Figure 27: Distribution of average expression level per gene after the filtering
We can visually observe which genes have been left out from the datatset in Figure 27.
To make gene expression values comparable across samples, normalization step is needed to continue with our analysis. We estimate a normalization factor for each library using the Trimmed Mean of M-Values and we apply it to our data.
dge.tumor.filt <- calcNormFactors(dge.tumor.filt)
assays(coadse.tumor.filt)$logCPM <- cpm(dge.tumor.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)
saveRDS(coadse.tumor.filt, file.path("results", "coadse.tumor.filt.rds"))
saveRDS(dge.tumor.filt, file.path("results", "dge.tumor.filt.rds"))
We now want to visualize the expression profiles of the normalized data for each tumor stage.
For Stage I –>
Figure 28: MA-plots of the tumor samples in Stage I
In Figure 28 we observe the MA-plots for the Tumor samples of stage I. We can see that even after the between and within normalization steps, we still have some artifacts in our data as we can observe for example in 28 A, C, F, AH and AT.
For Stage II –>
Figure 29: MA-plots of the tumor samples in Stage II
In Figure 29 we observe the MA-plots for the Tumor samples of stage II. We can see that the majority of MA plots tend to follow the blue line with some exceptions like Figure 29 B, E, F, N, AO, DM, DR, EI and EX.
For Stage III –>
Figure 30: MA-plots of the tumor samples in Stage III
In Figure 30 we observe the MA-plots for the Tumor samples of stage III. With the samples of stage III we see a similar situation compared to the previous MA-plots; we can see that the red line of the majority of plots tends to follow the blue line with some exceptions like Figure 30 J, K, AV, BB, BC, CT and DN.
For Stage IV –>
Figure 31: MA-plots of the tumor samples in Stage IV
Finally, in Figure 31 we observe the MA-plots for the Tumor samples of stage IV. In this case, we observe again a similar situation; there are some plots were we can detect some artifacts, like for example J, M, V, AD, AI and AZ.
As normalization of the data can not always remove completely the batch effect, the next step will be indeed the search of potential surrogate of batch effect indicators derived from different elements of the TCGA barcode of each sample.
tss.tumor <- substr(colnames(dge.tumor.filt), 6, 7)
table(data.frame(STAGE=coadse.tumor.filt$ajcc_pathologic_tumor_stage, TSS=tss.tumor))
TSS
STAGE 3L 4N 4T A6 AA AD AM AU AY AZ CA CK CM D5 DM F4 G4 NH QG QL RU
Stage I 1 0 0 2 31 3 0 1 3 3 0 4 9 5 0 4 2 0 1 1 0
Stage II 0 0 1 12 63 2 1 1 2 5 8 4 12 13 15 6 9 2 1 0 0
Stage III 0 1 0 15 37 2 0 0 3 4 1 4 10 11 8 5 10 3 3 0 1
Stage IV 0 0 0 7 24 0 1 0 2 8 0 2 5 1 1 1 5 3 0 0 0
TSS
STAGE WS
Stage I 0
Stage II 1
Stage III 0
Stage IV 0
center.tumor <- substr(colnames(dge.tumor.filt), 27, 28)
table(data.frame(STAGE=coadse.tumor.filt$ajcc_pathologic_tumor_stage, CENTER=center.tumor))
CENTER
STAGE 07
Stage I 70
Stage II 158
Stage III 118
Stage IV 60
samplevial.tumor <- substr(colnames(dge.tumor.filt), 14, 16)
table(data.frame(STAGE=coadse.tumor.filt$ajcc_pathologic_tumor_stage, SAMPLEVIAL=samplevial.tumor))
SAMPLEVIAL
STAGE 01A 01B
Stage I 70 0
Stage II 158 0
Stage III 117 1
Stage IV 58 2
portionanalyte.tumor <- substr(colnames(dge.tumor.filt), 18, 20)
table(data.frame(STAGE=coadse.tumor.filt$ajcc_pathologic_tumor_stage, PORTIONANALYTE=portionanalyte.tumor))
PORTIONANALYTE
STAGE 01R 02R 03R 11R 12R 21R 23R 31R 33R 42R 43R 72R
Stage I 28 6 0 31 2 1 1 0 0 0 1 0
Stage II 68 12 0 59 4 14 0 0 0 1 0 0
Stage III 33 14 0 51 0 18 0 2 0 0 0 0
Stage IV 31 2 1 18 1 3 0 2 1 0 0 1
plate.tumor <- substr(colnames(dge.tumor.filt), 22, 25)
table(data.frame(STAGE=coadse.tumor.filt$ajcc_pathologic_tumor_stage, PLATE=plate.tumor))
PLATE
STAGE 0821 0826 0905 1022 1113 1410 1653 1723 1774 1839 1873 1928 A002
Stage I 5 1 3 10 1 4 10 5 4 6 1 2 3
Stage II 10 3 12 15 1 11 13 14 13 16 0 12 6
Stage III 6 2 5 9 2 5 9 22 6 12 2 5 5
Stage IV 1 0 7 9 0 6 6 5 6 8 0 1 0
PLATE
STAGE A00A A083 A089 A155 A16W A180 A28H A32Y A32Z A37K A41B
Stage I 1 1 1 0 1 0 3 1 4 2 1
Stage II 5 1 0 8 6 0 3 3 2 2 2
Stage III 1 4 1 3 3 1 8 2 1 4 0
Stage IV 2 0 1 0 1 0 1 1 1 1 3
Too look for potential surrogates of batch effect indicators we need the maximum information explained by the possible batch generator, that means no zeros along the different groups. In order to be able to identify a possible batch effect and looking the previous tables, we decided to continue our analysis focusing in ‘plate’ information. Then we decided to see graphically how batch might be confounding the outcome of interest by doing a hierarchical clustering.
logCPM <- cpm(dge.tumor.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(plate.tumor))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(coadse.tumor.filt)
outcome <- paste(substr(colnames(coadse.tumor.filt), 9, 12), as.character(coadse.tumor.filt$ajcc_pathologic_tumor_stage), sep="-")
names(outcome) <- colnames(coadse.tumor.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(plate.tumor))), cex=0.75, fill=sort(unique(batch)))
Figure 10: Hierarchical clustering of the samples
As a result of this graph in figure 10, we concluded that the hierarchical clustering can become quite useless as a graphical tool when you have a huge quantity of samples, so then we decided to change our approach and try to identify graphically the batch effect with a MDS plot.
col.stage <- c("blue","cyan", "orange", "purple")[factor(coadse.tumor.filt$ajcc_pathologic_tumor_stage)]
batch <- as.integer(factor(plate.tumor))
plotMDS(dge.tumor.filt, dim=c(3,4), col=col.stage, pch= batch, cex=2)
legend("topleft", legend=levels(factor(coadse.tumor.filt$ajcc_pathologic_tumor_stage)), col=col.stage ,pch=16, cex= 0.75)
legend("topright", legend=levels(factor(plate.tumor)),pch= batch, cex=0.75, ncol = 2)
Figure 32: MDS plot
With the MDSplot in figure 32 we are still not able to recognize any cluster of any group, so we can not conclude that ‘plate’ information could be a potential surrogate of batch effect.
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] geneplotter_1.60.0 annotate_1.60.1
[3] XML_3.98-1.19 AnnotationDbi_1.44.0
[5] lattice_0.20-38 ggplot2_3.1.1
[7] edgeR_3.24.3 limma_3.38.3
[9] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[11] BiocParallel_1.16.6 matrixStats_0.54.0
[13] Biobase_2.42.0 GenomicRanges_1.34.0
[15] GenomeInfoDb_1.18.2 IRanges_2.16.0
[17] S4Vectors_0.20.1 BiocGenerics_0.28.0
[19] knitr_1.22 BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 locfit_1.5-9.1 assertthat_0.2.1
[4] digest_0.6.18 R6_2.4.0 plyr_1.8.4
[7] RSQLite_2.1.1 evaluate_0.13 highr_0.8
[10] pillar_1.3.1 zlibbioc_1.28.0 rlang_0.3.4
[13] lazyeval_0.2.2 blob_1.1.1 Matrix_1.2-17
[16] rmarkdown_1.12 labeling_0.3 stringr_1.4.0
[19] RCurl_1.95-4.12 bit_1.1-14 munsell_0.5.0
[22] compiler_3.5.3 xfun_0.6 pkgconfig_2.0.2
[25] htmltools_0.3.6 tidyselect_0.2.5 tibble_2.1.1
[28] GenomeInfoDbData_1.2.0 bookdown_0.9 codetools_0.2-16
[31] crayon_1.3.4 dplyr_0.8.1 withr_2.1.2
[34] bitops_1.0-6 grid_3.5.3 xtable_1.8-4
[37] gtable_0.3.0 DBI_1.0.0 magrittr_1.5
[40] scales_1.0.0 KernSmooth_2.23-15 stringi_1.4.3
[43] XVector_0.22.0 RColorBrewer_1.1-2 tools_3.5.3
[46] bit64_0.9-7 glue_1.3.1 purrr_0.3.2
[49] yaml_2.2.0 colorspace_1.4-1 BiocManager_1.30.4
[52] memoise_1.1.0
After having filtered and normalized our tumor dataset, we looked for differential expression of genes among the cancer stages. We applied a limma voom pipeline, the steps of which are explained in depth already in the paired DEanalysis section and for this reason here we only focus on the steps that differ in the new one. Since the tumor stage has 4 different values (“StageI”,“StageII”,“StageIII”, “StageIV”“), we decide to use a model with no intercept and a contrast matrix instead, following the instruction of the”Several groups" section of the limma user guide. After having obtained the model with the help of the contrast matrix, we applied eBayes. After multiple test adjusting, we are left with no DE genes.
coadse.tumor$ajcc_pathologic_tumor_stage <- gsub(x = coadse.tumor$ajcc_pathologic_tumor_stage, pattern = " ", replacement = "")
coadse.tumor$ajcc_pathologic_tumor_stage <- droplevels(as.factor(coadse.tumor$ajcc_pathologic_tumor_stage))
coadse.tumor$age_at_initial_pathologic_diagnosis <- droplevels(coadse.tumor$age_at_initial_pathologic_diagnosis)
lev <- c("StageI","StageII","StageIII", "StageIV")
f <- factor(coadse.tumor$ajcc_pathologic_tumor_stage, levels=lev)
design <- model.matrix(~0+f , colData(coadse.tumor))
colnames(design) <- lev
fit1 <- lmFit(assays(coadse.tumor)$logCPM,design)
contrast.matrix <- makeContrasts(StageI-StageII, StageI-StageIII, StageI-StageIV, StageII-StageIII, StageII-StageIV, StageIII-StageIV, levels=design)
fit2 <- contrasts.fit(fit1, contrast.matrix)
fit2 <- eBayes(fit2)
tt<-topTable(fit2, coef=1, n=Inf,adjust="fdr")
DEgenes <- rownames(tt)[tt$adj.P.Val < 0.1]
length(DEgenes)
[1] 0
saveRDS(tt, file.path("results", "tt.rds"))
This can be due to the fact that in many different biological contexts gene expression changes may be very little and after Multiple test adjusting no significant DE genes can be identified. We then decide to run a Principal Component Analysis to further explore the gene expression among the four cancer stages. Since we use the prcomp() function, we first need to transpose our matrix in order to have the samples as rows, if not we would get a pca analysis explaining how genes are related to each other, which in this case is not the desired output. After having prepared our data, we run the pca analyis and we can observe that the first four principal components explain respectively 11.7%, 7%, 6% and 4% of the variability.
a<-assays(coadse.tumor)
ta<-t(assays(coadse.tumor)$logCPM)
pca<- prcomp(ta)
summary(pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 44.6935 36.51157 32.42333 28.17916 25.07691 20.6460
Proportion of Variance 0.1171 0.07818 0.06165 0.04657 0.03688 0.0250
Cumulative Proportion 0.1171 0.19531 0.25696 0.30353 0.34040 0.3654
PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 18.59050 17.71089 17.08951 16.53713 16.14581 14.85982
Proportion of Variance 0.02027 0.01839 0.01713 0.01604 0.01529 0.01295
Cumulative Proportion 0.38567 0.40406 0.42119 0.43723 0.45251 0.46546
PC13 PC14 PC15 PC16 PC17 PC18
Standard deviation 14.48627 13.96364 13.04648 12.84439 12.54125 12.17461
Proportion of Variance 0.01231 0.01143 0.00998 0.00967 0.00922 0.00869
Cumulative Proportion 0.47777 0.48920 0.49918 0.50886 0.51808 0.52677
PC19 PC20 PC21 PC22 PC23 PC24
Standard deviation 11.98712 11.63894 11.46548 10.98483 10.8444 10.6052
Proportion of Variance 0.00843 0.00794 0.00771 0.00708 0.0069 0.0066
Cumulative Proportion 0.53520 0.54314 0.55085 0.55793 0.5648 0.5714
PC25 PC26 PC27 PC28 PC29 PC30
Standard deviation 10.16535 10.06506 10.04708 9.83001 9.43738 9.30215
Proportion of Variance 0.00606 0.00594 0.00592 0.00567 0.00522 0.00507
Cumulative Proportion 0.57748 0.58342 0.58934 0.59501 0.60023 0.60530
PC31 PC32 PC33 PC34 PC35 PC36
Standard deviation 9.03161 8.98159 8.8583 8.70738 8.38520 8.26667
Proportion of Variance 0.00478 0.00473 0.0046 0.00445 0.00412 0.00401
Cumulative Proportion 0.61009 0.61482 0.6194 0.62387 0.62799 0.63200
PC37 PC38 PC39 PC40 PC41 PC42
Standard deviation 8.22396 8.1527 8.01166 7.95894 7.90294 7.71894
Proportion of Variance 0.00397 0.0039 0.00376 0.00371 0.00366 0.00349
Cumulative Proportion 0.63596 0.6399 0.64363 0.64734 0.65100 0.65450
PC43 PC44 PC45 PC46 PC47 PC48
Standard deviation 7.62120 7.48398 7.43333 7.37608 7.31295 7.25205
Proportion of Variance 0.00341 0.00328 0.00324 0.00319 0.00314 0.00308
Cumulative Proportion 0.65790 0.66119 0.66443 0.66762 0.67075 0.67384
PC49 PC50 PC51 PC52 PC53 PC54
Standard deviation 7.19878 7.13457 7.05670 6.99305 6.92780 6.85429
Proportion of Variance 0.00304 0.00298 0.00292 0.00287 0.00281 0.00276
Cumulative Proportion 0.67688 0.67986 0.68278 0.68565 0.68846 0.69122
PC55 PC56 PC57 PC58 PC59 PC60 PC61
Standard deviation 6.82387 6.7797 6.74418 6.66551 6.61843 6.58317 6.5330
Proportion of Variance 0.00273 0.0027 0.00267 0.00261 0.00257 0.00254 0.0025
Cumulative Proportion 0.69395 0.6966 0.69931 0.70192 0.70449 0.70703 0.7095
PC62 PC63 PC64 PC65 PC66 PC67
Standard deviation 6.46915 6.42760 6.42152 6.35342 6.29894 6.2584
Proportion of Variance 0.00245 0.00242 0.00242 0.00237 0.00233 0.0023
Cumulative Proportion 0.71199 0.71441 0.71683 0.71919 0.72152 0.7238
PC68 PC69 PC70 PC71 PC72 PC73
Standard deviation 6.21788 6.14443 6.1309 6.05797 6.02530 6.00753
Proportion of Variance 0.00227 0.00221 0.0022 0.00215 0.00213 0.00212
Cumulative Proportion 0.72608 0.72830 0.7305 0.73265 0.73478 0.73690
PC74 PC75 PC76 PC77 PC78 PC79
Standard deviation 5.95645 5.93414 5.91833 5.88620 5.81640 5.79733
Proportion of Variance 0.00208 0.00207 0.00205 0.00203 0.00198 0.00197
Cumulative Proportion 0.73898 0.74105 0.74310 0.74513 0.74712 0.74909
PC80 PC81 PC82 PC83 PC84 PC85
Standard deviation 5.77299 5.74475 5.72211 5.6907 5.64574 5.62126
Proportion of Variance 0.00195 0.00194 0.00192 0.0019 0.00187 0.00185
Cumulative Proportion 0.75104 0.75298 0.75490 0.7568 0.75866 0.76052
PC86 PC87 PC88 PC89 PC90 PC91
Standard deviation 5.60174 5.57095 5.54936 5.47291 5.46457 5.44289
Proportion of Variance 0.00184 0.00182 0.00181 0.00176 0.00175 0.00174
Cumulative Proportion 0.76236 0.76418 0.76598 0.76774 0.76949 0.77123
PC92 PC93 PC94 PC95 PC96 PC97
Standard deviation 5.43243 5.3871 5.36085 5.29807 5.29398 5.28114
Proportion of Variance 0.00173 0.0017 0.00169 0.00165 0.00164 0.00164
Cumulative Proportion 0.77296 0.7747 0.77635 0.77799 0.77964 0.78127
PC98 PC99 PC100 PC101 PC102 PC103
Standard deviation 5.23626 5.23255 5.20496 5.19151 5.18347 5.16841
Proportion of Variance 0.00161 0.00161 0.00159 0.00158 0.00158 0.00157
Cumulative Proportion 0.78288 0.78448 0.78607 0.78765 0.78923 0.79080
PC104 PC105 PC106 PC107 PC108 PC109
Standard deviation 5.15305 5.12863 5.12108 5.07225 5.06783 5.04423
Proportion of Variance 0.00156 0.00154 0.00154 0.00151 0.00151 0.00149
Cumulative Proportion 0.79235 0.79390 0.79543 0.79694 0.79845 0.79994
PC110 PC111 PC112 PC113 PC114 PC115
Standard deviation 5.01331 4.99007 4.98494 4.94455 4.92491 4.90752
Proportion of Variance 0.00147 0.00146 0.00146 0.00143 0.00142 0.00141
Cumulative Proportion 0.80141 0.80287 0.80433 0.80577 0.80719 0.80860
PC116 PC117 PC118 PC119 PC120 PC121 PC122
Standard deviation 4.8891 4.8850 4.86350 4.82485 4.82178 4.81539 4.78333
Proportion of Variance 0.0014 0.0014 0.00139 0.00137 0.00136 0.00136 0.00134
Cumulative Proportion 0.8100 0.8114 0.81279 0.81415 0.81552 0.81688 0.81822
PC123 PC124 PC125 PC126 PC127 PC128
Standard deviation 4.78080 4.76804 4.74201 4.74120 4.68750 4.68386
Proportion of Variance 0.00134 0.00133 0.00132 0.00132 0.00129 0.00129
Cumulative Proportion 0.81956 0.82089 0.82221 0.82353 0.82482 0.82610
PC129 PC130 PC131 PC132 PC133 PC134
Standard deviation 4.67484 4.66644 4.65354 4.60859 4.60377 4.59877
Proportion of Variance 0.00128 0.00128 0.00127 0.00125 0.00124 0.00124
Cumulative Proportion 0.82739 0.82866 0.82993 0.83118 0.83242 0.83366
PC135 PC136 PC137 PC138 PC139 PC140 PC141
Standard deviation 4.57334 4.56747 4.55318 4.5223 4.5179 4.49251 4.47413
Proportion of Variance 0.00123 0.00122 0.00122 0.0012 0.0012 0.00118 0.00117
Cumulative Proportion 0.83489 0.83611 0.83733 0.8385 0.8397 0.84091 0.84208
PC142 PC143 PC144 PC145 PC146 PC147
Standard deviation 4.45544 4.44353 4.43244 4.42651 4.41182 4.39015
Proportion of Variance 0.00116 0.00116 0.00115 0.00115 0.00114 0.00113
Cumulative Proportion 0.84324 0.84440 0.84555 0.84670 0.84784 0.84897
PC148 PC149 PC150 PC151 PC152 PC153 PC154
Standard deviation 4.37981 4.37028 4.35685 4.34626 4.3358 4.3228 4.30286
Proportion of Variance 0.00112 0.00112 0.00111 0.00111 0.0011 0.0011 0.00109
Cumulative Proportion 0.85010 0.85122 0.85233 0.85344 0.8545 0.8556 0.85672
PC155 PC156 PC157 PC158 PC159 PC160
Standard deviation 4.27687 4.27060 4.25577 4.25117 4.23543 4.22520
Proportion of Variance 0.00107 0.00107 0.00106 0.00106 0.00105 0.00105
Cumulative Proportion 0.85780 0.85887 0.85993 0.86099 0.86204 0.86309
PC161 PC162 PC163 PC164 PC165 PC166
Standard deviation 4.21232 4.18941 4.17341 4.16563 4.15160 4.14360
Proportion of Variance 0.00104 0.00103 0.00102 0.00102 0.00101 0.00101
Cumulative Proportion 0.86413 0.86516 0.86618 0.86720 0.86821 0.86921
PC167 PC168 PC169 PC170 PC171 PC172
Standard deviation 4.1327 4.10695 4.09876 4.07790 4.07162 4.06287
Proportion of Variance 0.0010 0.00099 0.00099 0.00098 0.00097 0.00097
Cumulative Proportion 0.8702 0.87120 0.87219 0.87316 0.87414 0.87510
PC173 PC174 PC175 PC176 PC177 PC178
Standard deviation 4.05339 4.04455 4.01090 4.00330 3.99674 3.98506
Proportion of Variance 0.00096 0.00096 0.00094 0.00094 0.00094 0.00093
Cumulative Proportion 0.87607 0.87703 0.87797 0.87891 0.87985 0.88078
PC179 PC180 PC181 PC182 PC183 PC184 PC185
Standard deviation 3.95933 3.95520 3.94865 3.93825 3.93104 3.9245 3.9157
Proportion of Variance 0.00092 0.00092 0.00091 0.00091 0.00091 0.0009 0.0009
Cumulative Proportion 0.88170 0.88262 0.88353 0.88444 0.88535 0.8862 0.8871
PC186 PC187 PC188 PC189 PC190 PC191
Standard deviation 3.89855 3.88968 3.88385 3.87187 3.86025 3.84428
Proportion of Variance 0.00089 0.00089 0.00088 0.00088 0.00087 0.00087
Cumulative Proportion 0.88804 0.88893 0.88981 0.89069 0.89156 0.89243
PC192 PC193 PC194 PC195 PC196 PC197
Standard deviation 3.83863 3.82117 3.81845 3.79988 3.79161 3.77460
Proportion of Variance 0.00086 0.00086 0.00086 0.00085 0.00084 0.00084
Cumulative Proportion 0.89329 0.89415 0.89501 0.89585 0.89670 0.89753
PC198 PC199 PC200 PC201 PC202 PC203
Standard deviation 3.76647 3.75734 3.74509 3.73608 3.73207 3.71334
Proportion of Variance 0.00083 0.00083 0.00082 0.00082 0.00082 0.00081
Cumulative Proportion 0.89836 0.89919 0.90001 0.90083 0.90165 0.90246
PC204 PC205 PC206 PC207 PC208 PC209 PC210
Standard deviation 3.70693 3.6999 3.6907 3.67838 3.67069 3.64806 3.63786
Proportion of Variance 0.00081 0.0008 0.0008 0.00079 0.00079 0.00078 0.00078
Cumulative Proportion 0.90326 0.9041 0.9049 0.90566 0.90645 0.90723 0.90800
PC211 PC212 PC213 PC214 PC215 PC216
Standard deviation 3.63668 3.62390 3.61442 3.60922 3.60190 3.58148
Proportion of Variance 0.00078 0.00077 0.00077 0.00076 0.00076 0.00075
Cumulative Proportion 0.90878 0.90955 0.91032 0.91108 0.91184 0.91259
PC217 PC218 PC219 PC220 PC221 PC222
Standard deviation 3.56900 3.56600 3.55709 3.55361 3.52974 3.51467
Proportion of Variance 0.00075 0.00075 0.00074 0.00074 0.00073 0.00072
Cumulative Proportion 0.91334 0.91409 0.91483 0.91557 0.91630 0.91702
PC223 PC224 PC225 PC226 PC227 PC228 PC229
Standard deviation 3.50518 3.49158 3.48292 3.47675 3.47090 3.4623 3.4546
Proportion of Variance 0.00072 0.00071 0.00071 0.00071 0.00071 0.0007 0.0007
Cumulative Proportion 0.91774 0.91846 0.91917 0.91988 0.92059 0.9213 0.9220
PC230 PC231 PC232 PC233 PC234 PC235
Standard deviation 3.42926 3.41371 3.40741 3.40331 3.39621 3.38637
Proportion of Variance 0.00069 0.00068 0.00068 0.00068 0.00068 0.00067
Cumulative Proportion 0.92268 0.92336 0.92404 0.92472 0.92540 0.92607
PC236 PC237 PC238 PC239 PC240 PC241
Standard deviation 3.38031 3.37654 3.37250 3.36566 3.34443 3.33955
Proportion of Variance 0.00067 0.00067 0.00067 0.00066 0.00066 0.00065
Cumulative Proportion 0.92674 0.92741 0.92808 0.92874 0.92940 0.93005
PC242 PC243 PC244 PC245 PC246 PC247
Standard deviation 3.33588 3.32950 3.31844 3.31419 3.30190 3.29236
Proportion of Variance 0.00065 0.00065 0.00065 0.00064 0.00064 0.00064
Cumulative Proportion 0.93070 0.93135 0.93200 0.93264 0.93328 0.93392
PC248 PC249 PC250 PC251 PC252 PC253
Standard deviation 3.28535 3.28141 3.26743 3.26422 3.25422 3.24564
Proportion of Variance 0.00063 0.00063 0.00063 0.00062 0.00062 0.00062
Cumulative Proportion 0.93455 0.93518 0.93581 0.93643 0.93705 0.93767
PC254 PC255 PC256 PC257 PC258 PC259 PC260
Standard deviation 3.23760 3.22960 3.2105 3.1997 3.1901 3.18130 3.17704
Proportion of Variance 0.00061 0.00061 0.0006 0.0006 0.0006 0.00059 0.00059
Cumulative Proportion 0.93829 0.93890 0.9395 0.9401 0.9407 0.94129 0.94189
PC261 PC262 PC263 PC264 PC265 PC266
Standard deviation 3.17478 3.16568 3.16051 3.15364 3.14908 3.12888
Proportion of Variance 0.00059 0.00059 0.00059 0.00058 0.00058 0.00057
Cumulative Proportion 0.94248 0.94306 0.94365 0.94423 0.94481 0.94539
PC267 PC268 PC269 PC270 PC271 PC272
Standard deviation 3.11436 3.11310 3.10912 3.10374 3.09464 3.08761
Proportion of Variance 0.00057 0.00057 0.00057 0.00056 0.00056 0.00056
Cumulative Proportion 0.94596 0.94653 0.94709 0.94766 0.94822 0.94878
PC273 PC274 PC275 PC276 PC277 PC278
Standard deviation 3.07420 3.07246 3.06774 3.04938 3.04524 3.03002
Proportion of Variance 0.00055 0.00055 0.00055 0.00055 0.00054 0.00054
Cumulative Proportion 0.94933 0.94989 0.95044 0.95098 0.95153 0.95207
PC279 PC280 PC281 PC282 PC283 PC284
Standard deviation 3.02888 3.02651 3.01281 3.00031 2.99465 2.98222
Proportion of Variance 0.00054 0.00054 0.00053 0.00053 0.00053 0.00052
Cumulative Proportion 0.95260 0.95314 0.95367 0.95420 0.95473 0.95525
PC285 PC286 PC287 PC288 PC289 PC290 PC291
Standard deviation 2.97629 2.97135 2.96061 2.95295 2.93863 2.9325 2.9254
Proportion of Variance 0.00052 0.00052 0.00051 0.00051 0.00051 0.0005 0.0005
Cumulative Proportion 0.95577 0.95629 0.95680 0.95731 0.95782 0.9583 0.9588
PC292 PC293 PC294 PC295 PC296 PC297
Standard deviation 2.9065 2.90285 2.89613 2.88550 2.88129 2.87175
Proportion of Variance 0.0005 0.00049 0.00049 0.00049 0.00049 0.00048
Cumulative Proportion 0.9593 0.95981 0.96030 0.96079 0.96128 0.96176
PC298 PC299 PC300 PC301 PC302 PC303
Standard deviation 2.86229 2.85615 2.84741 2.83469 2.82846 2.82636
Proportion of Variance 0.00048 0.00048 0.00048 0.00047 0.00047 0.00047
Cumulative Proportion 0.96224 0.96272 0.96320 0.96367 0.96414 0.96461
PC304 PC305 PC306 PC307 PC308 PC309
Standard deviation 2.81953 2.81194 2.80863 2.79527 2.79464 2.77689
Proportion of Variance 0.00047 0.00046 0.00046 0.00046 0.00046 0.00045
Cumulative Proportion 0.96507 0.96554 0.96600 0.96646 0.96692 0.96737
PC310 PC311 PC312 PC313 PC314 PC315
Standard deviation 2.77457 2.76934 2.75810 2.75248 2.74274 2.73410
Proportion of Variance 0.00045 0.00045 0.00045 0.00044 0.00044 0.00044
Cumulative Proportion 0.96782 0.96827 0.96871 0.96916 0.96960 0.97004
PC316 PC317 PC318 PC319 PC320 PC321
Standard deviation 2.72543 2.72232 2.71713 2.70336 2.69912 2.68869
Proportion of Variance 0.00044 0.00043 0.00043 0.00043 0.00043 0.00042
Cumulative Proportion 0.97047 0.97091 0.97134 0.97177 0.97220 0.97262
PC322 PC323 PC324 PC325 PC326 PC327
Standard deviation 2.68677 2.67612 2.67131 2.66771 2.65190 2.63964
Proportion of Variance 0.00042 0.00042 0.00042 0.00042 0.00041 0.00041
Cumulative Proportion 0.97304 0.97346 0.97388 0.97430 0.97471 0.97512
PC328 PC329 PC330 PC331 PC332 PC333 PC334
Standard deviation 2.62973 2.62841 2.6241 2.6181 2.6052 2.5954 2.59464
Proportion of Variance 0.00041 0.00041 0.0004 0.0004 0.0004 0.0004 0.00039
Cumulative Proportion 0.97553 0.97593 0.9763 0.9767 0.9771 0.9775 0.97793
PC335 PC336 PC337 PC338 PC339 PC340
Standard deviation 2.58933 2.57766 2.57635 2.56835 2.56263 2.54871
Proportion of Variance 0.00039 0.00039 0.00039 0.00039 0.00039 0.00038
Cumulative Proportion 0.97832 0.97871 0.97910 0.97948 0.97987 0.98025
PC341 PC342 PC343 PC344 PC345 PC346
Standard deviation 2.54729 2.53286 2.52395 2.52075 2.51463 2.50988
Proportion of Variance 0.00038 0.00038 0.00037 0.00037 0.00037 0.00037
Cumulative Proportion 0.98063 0.98101 0.98138 0.98175 0.98212 0.98249
PC347 PC348 PC349 PC350 PC351 PC352
Standard deviation 2.48787 2.47835 2.47456 2.46291 2.45721 2.45083
Proportion of Variance 0.00036 0.00036 0.00036 0.00036 0.00035 0.00035
Cumulative Proportion 0.98286 0.98322 0.98358 0.98393 0.98429 0.98464
PC353 PC354 PC355 PC356 PC357 PC358
Standard deviation 2.44010 2.43357 2.42909 2.42339 2.41580 2.41093
Proportion of Variance 0.00035 0.00035 0.00035 0.00034 0.00034 0.00034
Cumulative Proportion 0.98499 0.98533 0.98568 0.98602 0.98637 0.98671
PC359 PC360 PC361 PC362 PC363 PC364
Standard deviation 2.39708 2.39219 2.38403 2.37501 2.36474 2.36072
Proportion of Variance 0.00034 0.00034 0.00033 0.00033 0.00033 0.00033
Cumulative Proportion 0.98704 0.98738 0.98771 0.98804 0.98837 0.98870
PC365 PC366 PC367 PC368 PC369 PC370
Standard deviation 2.34682 2.34017 2.33101 2.32409 2.32219 2.31027
Proportion of Variance 0.00032 0.00032 0.00032 0.00032 0.00032 0.00031
Cumulative Proportion 0.98902 0.98934 0.98966 0.98998 0.99030 0.99061
PC371 PC372 PC373 PC374 PC375 PC376 PC377
Standard deviation 2.30664 2.29914 2.28848 2.2800 2.2738 2.2611 2.2561
Proportion of Variance 0.00031 0.00031 0.00031 0.0003 0.0003 0.0003 0.0003
Cumulative Proportion 0.99092 0.99123 0.99154 0.9918 0.9921 0.9925 0.9927
PC378 PC379 PC380 PC381 PC382 PC383
Standard deviation 2.2495 2.23862 2.23139 2.22381 2.20932 2.19898
Proportion of Variance 0.0003 0.00029 0.00029 0.00029 0.00029 0.00028
Cumulative Proportion 0.9930 0.99333 0.99363 0.99392 0.99420 0.99449
PC384 PC385 PC386 PC387 PC388 PC389
Standard deviation 2.19738 2.19201 2.17961 2.16815 2.16281 2.15704
Proportion of Variance 0.00028 0.00028 0.00028 0.00028 0.00027 0.00027
Cumulative Proportion 0.99477 0.99505 0.99533 0.99561 0.99588 0.99615
PC390 PC391 PC392 PC393 PC394 PC395
Standard deviation 2.13686 2.13183 2.11173 2.09438 2.08236 2.07396
Proportion of Variance 0.00027 0.00027 0.00026 0.00026 0.00025 0.00025
Cumulative Proportion 0.99642 0.99669 0.99695 0.99721 0.99746 0.99771
PC396 PC397 PC398 PC399 PC400 PC401
Standard deviation 2.03804 2.03039 2.01921 2.01246 2.01009 1.98283
Proportion of Variance 0.00024 0.00024 0.00024 0.00024 0.00024 0.00023
Cumulative Proportion 0.99796 0.99820 0.99844 0.99867 0.99891 0.99914
PC402 PC403 PC404 PC405 PC406
Standard deviation 1.95088 1.91258 1.90654 1.88289 3.267e-14
Proportion of Variance 0.00022 0.00021 0.00021 0.00021 0.000e+00
Cumulative Proportion 0.99936 0.99958 0.99979 1.00000 1.000e+00
dtp <- data.frame('Stages' = coadse.tumor$ajcc_pathologic_tumor_stage, pca$x[,1:2]) # the first two componets are selected (NB: you can also select 3 for 3D plottings or 3+)
ggplot(data = dtp) + geom_point(aes(x = PC1, y = PC2, col = Stages)) + theme_minimal()
Figure 33: PC 1 and PC 2 by cancer Stage
In Figure 33 it seems that there is no differential expression among the stages. However, we decide to proceed with the GSE Analysis. In fact, there may be small but consistent changes in gene expression values which could still imply relevant changes between the different stages.
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GSVA_1.30.0 GSVAdata_1.18.0
[3] hgu95a.db_3.2.3 GSEABase_1.44.0
[5] org.Hs.eg.db_3.7.0 xtable_1.8-4
[7] GOstats_2.48.0 graph_1.60.0
[9] Category_2.48.1 Matrix_1.2-17
[11] ggplot2_3.1.1 calibrate_1.7.2
[13] MASS_7.3-51.4 geneplotter_1.60.0
[15] annotate_1.60.1 XML_3.98-1.19
[17] AnnotationDbi_1.44.0 lattice_0.20-38
[19] edgeR_3.24.3 limma_3.38.3
[21] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[23] BiocParallel_1.16.6 matrixStats_0.54.0
[25] Biobase_2.42.0 GenomicRanges_1.34.0
[27] GenomeInfoDb_1.18.2 IRanges_2.16.0
[29] S4Vectors_0.20.1 BiocGenerics_0.28.0
[31] knitr_1.22 BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
[4] Rgraphviz_2.26.0 tools_3.5.3 R6_2.4.0
[7] DBI_1.0.0 lazyeval_0.2.2 colorspace_1.4-1
[10] withr_2.1.2 tidyselect_0.2.5 bit_1.1-14
[13] compiler_3.5.3 labeling_0.3 bookdown_0.9
[16] scales_1.0.0 genefilter_1.64.0 RBGL_1.58.2
[19] stringr_1.4.0 digest_0.6.18 rmarkdown_1.12
[22] AnnotationForge_1.24.0 XVector_0.22.0 pkgconfig_2.0.2
[25] htmltools_0.3.6 highr_0.8 rlang_0.3.4
[28] RSQLite_2.1.1 shiny_1.3.2 dplyr_0.8.1
[31] RCurl_1.95-4.12 magrittr_1.5 GO.db_3.7.0
[34] GenomeInfoDbData_1.2.0 Rcpp_1.0.1 munsell_0.5.0
[37] stringi_1.4.3 yaml_2.2.0 zlibbioc_1.28.0
[40] plyr_1.8.4 grid_3.5.3 blob_1.1.1
[43] promises_1.0.1 crayon_1.3.4 splines_3.5.3
[46] locfit_1.5-9.1 pillar_1.3.1 codetools_0.2-16
[49] glue_1.3.1 evaluate_0.13 BiocManager_1.30.4
[52] httpuv_1.5.1 gtable_0.3.0 purrr_0.3.2
[55] assertthat_0.2.1 xfun_0.6 mime_0.6
[58] later_0.8.0 survival_2.44-1.1 tibble_2.1.1
[61] shinythemes_1.1.2 memoise_1.1.0
Here we ran the GSEA analysis for the tumor dataset. Same as for the DE Analysis, we decided not to explain again every step of the pipeline, which in this case were all explained in detail in the FE Analysis section of the paired dataset.
Same as for the paired dataset, we decided to restrict our analysis to the KEGG, REACTOME and BIOCARTA genesets. We then built the incidence matrix and we only selected the genesets which are present in both our dataset and the incidence matrix. Furthermore, we filtered out the genesets that contain less than 5 genes.
After that, we calculated the z-scores for the genesets as well as the adjusted p-values and focus on the ones with lower p-values.
# collect gene sets
data(c2BroadSets)
gsc <- GeneSetCollection(c2BroadSets)
gsc <- gsc[c(grep("^KEGG", names(gsc)),grep("^REACTOME", names(gsc)), grep("^BIOCARTA", names(gsc)))]
gsc <- mapIdentifiers(gsc, AnnoOrEntrezIdentifier(metadata(coadse.tumor)$annotation))
# Know which gene is in which geneset
Im <- incidence(gsc)
# Discard genes that are not in our data
Im <- Im[, colnames(Im) %in% rownames(coadse.tumor)]
# Discard genes that are not in the dataset
coadse.tumor <- coadse.tumor[colnames(Im), ]
dge.tumor <- dge.tumor[colnames(Im),]
# Only select genesets of decent size
Im <- Im[rowSums(Im) >= 5, ]
# Z SCORES
# calc moderated test statistics
tGSgenes <- tt[match(colnames(Im), rownames(tt)), "t"]
# calc the z test statistic
zS <- sqrt(rowSums(Im)) * (as.vector(Im %*% tGSgenes)/rowSums(Im))
#Look at the first few geneset
rnkGS <- sort(abs(zS), decreasing = TRUE)
pv <- pmin(pnorm(zS), 1 - pnorm(zS))
sum(pv < 0.5)
[1] 812
pvadj <- p.adjust(pv, method = "fdr")
DEgs <- names(pvadj)[which(pvadj < 0.1)]
DEgs
[1] "KEGG_OXIDATIVE_PHOSPHORYLATION"
[2] "REACTOME_AXON_GUIDANCE"
[3] "REACTOME_DIABETES_PATHWAYS"
[4] "REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING"
[5] "REACTOME_FORMATION_OF_PLATELET_PLUG"
[6] "REACTOME_FURTHER_PLATELET_RELEASATE"
[7] "REACTOME_G_ALPHA_S_SIGNALLING_EVENTS"
[8] "REACTOME_G_PROTEIN_BETA_GAMMA_SIGNALLING"
[9] "REACTOME_HEMOSTASIS"
[10] "REACTOME_NCAM_SIGNALING_FOR_NEURITE_OUT_GROWTH"
[11] "REACTOME_NCAM1_INTERACTIONS"
[12] "REACTOME_PLATELET_ACTIVATION"
[13] "REACTOME_PLATELET_DEGRANULATION"
[14] "REACTOME_REGULATION_OF_INSULIN_SECRETION"
[15] "REACTOME_SIGNALING_BY_PDGF"
[16] "REACTOME_ADP_SIGNALLING_THROUGH_P2Y_PURINOCEPTOR_12"
[17] "REACTOME_ACTIVATION_OF_KAINATE_RECEPTORS_UPON_GLUTAMATE_BINDING"
[18] "REACTOME_G_BETA_GAMMA_SIGNALLING_THROUGH_PLC_BETA"
[19] "REACTOME_G_BETA_GAMMA_SIGNALLING_THROUGH_PI3KGAMMA"
[20] "REACTOME_INHIBITION_OF_INSULIN_SECRETION_BY_ADRENALINE_NORADRENALINE"
[21] "REACTOME_SMOOTH_MUSCLE_CONTRACTION"
[22] "BIOCARTA_SALMONELLA_PATHWAY"
In the tumor-stages analysis there was not enough statistical power to detect DE genes in the different tumorogenic stages (I-IV), previosly to the gene set enrichment. However, after applying GSEA to the data we could see some relevant pathways that may differe between those stages.
Although many genes have been associated with the increased risk of CRC, the genetic differences across different stages of CRC have not been clearly identified Lorenc et al. (2017). Some genes have been reported to be potentially associated with higher stages of the cancer, as NEK4, RNF34 (implied in senescence and apoptosis) and NUDT6 (control signaling compounds and degradates potentially mutagenic oxidized nucleotides), which are expected to be in low concentration for higher stage of the disease.
For example, among the top 20 patways ranked according to the Z-score test, there are included those related to Homeostasis, the signaling pathway of PDGF, and the PI3K pathway. PDGF has been found to an important growth factor for normal tissue growth and division Manzat Saplacan et al. (2017), and also corelated with CRC invasion and metastasis when deregulated. Regarding to PI3K pathway, has also been related with loss of Adenomatous Polyposis Coli (APC), commented to be potentially important in CRC development.
However, we still miss some information that may be the reason for a non statistical reliable results. In our tumor stages study, some limitations are found: we only use TCGA CRC data on samples from cancer patients, and thus the analysis was only performed on these samples (difficulty to have a control); many field of the TCGA data were missing (NAs), which was not possible to be evaluated, and may be alterating gene expression in some of the genes of interest.
Finally, the last test we decided to perform to try to identy visually some kind of differences between the different stages was a boxplot of the different logCPM values between stages in those genes with highest logFC values.
boxplotgenes <- function(se, gene) {
iterations = dim(se)[2]
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
output <- data.frame(output)
colnames(output) <- c("stage", "logCPM")
aa <- se[rowData(se)$symbol == gene]$ajcc_pathologic_tumor_stage
bb<-assays(se[rowData(se)$symbol == gene])$logCPM
for(i in 1:iterations){
output$stage[i] <- aa[i]
output$logCPM[i] <- bb[i]
}
output$stage<-gsub(x = output$stage, pattern = "1", replacement = "Stage I")
output$stage<-gsub(x = output$stage, pattern = "2", replacement = "Stage II")
output$stage<-gsub(x = output$stage, pattern = "3", replacement = "Stage III")
output$stage<-gsub(x = output$stage, pattern = "4", replacement = "Stage IV")
boxplot(logCPM ~ stage, data=output, col=c("grey","red", "pink", "cyan", main= gene , ylab="logCPM"))
}
all_DEgenes <- tt
all_DEgenes_sorted <- all_DEgenes[order(all_DEgenes$adj.P.Val),]
genes<-rownames(all_DEgenes_sorted)[13]
gene<-dge.tumor$genes$symbol[rownames(dge.tumor$genes) == genes]
boxplotgenes(coadse.tumor, gene)
Figure 20: Boxplot of RIPK3
Finally, in the figure 20 we have represented the different logCPM values among stages of RIPK3 gene, the thirteenth gene with a higher logFC value, and a gene that has been suggested as a potential predictive and prognostic marker in metastatic colon cancer Conev et al. (2019). Looking at the plot, we can not see important differences among stages but a subtle tendency of increasing the logCPM values stage to stage.
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GOstats_2.48.0 ggplot2_3.1.1
[3] calibrate_1.7.2 MASS_7.3-51.4
[5] Category_2.48.1 Matrix_1.2-17
[7] GSVA_1.30.0 GSVAdata_1.18.0
[9] hgu95a.db_3.2.3 GSEABase_1.44.0
[11] graph_1.60.0 org.Hs.eg.db_3.7.0
[13] xtable_1.8-4 geneplotter_1.60.0
[15] annotate_1.60.1 XML_3.98-1.19
[17] AnnotationDbi_1.44.0 lattice_0.20-38
[19] edgeR_3.24.3 limma_3.38.3
[21] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[23] BiocParallel_1.16.6 matrixStats_0.54.0
[25] Biobase_2.42.0 GenomicRanges_1.34.0
[27] GenomeInfoDb_1.18.2 IRanges_2.16.0
[29] S4Vectors_0.20.1 BiocGenerics_0.28.0
[31] knitr_1.22 BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
[4] Rgraphviz_2.26.0 tools_3.5.3 R6_2.4.0
[7] DBI_1.0.0 lazyeval_0.2.2 colorspace_1.4-1
[10] withr_2.1.2 tidyselect_0.2.5 bit_1.1-14
[13] compiler_3.5.3 bookdown_0.9 scales_1.0.0
[16] genefilter_1.64.0 RBGL_1.58.2 stringr_1.4.0
[19] digest_0.6.18 rmarkdown_1.12 AnnotationForge_1.24.0
[22] XVector_0.22.0 pkgconfig_2.0.2 htmltools_0.3.6
[25] highr_0.8 rlang_0.3.4 RSQLite_2.1.1
[28] shiny_1.3.2 dplyr_0.8.1 RCurl_1.95-4.12
[31] magrittr_1.5 GO.db_3.7.0 GenomeInfoDbData_1.2.0
[34] Rcpp_1.0.1 munsell_0.5.0 stringi_1.4.3
[37] yaml_2.2.0 zlibbioc_1.28.0 plyr_1.8.4
[40] grid_3.5.3 blob_1.1.1 promises_1.0.1
[43] crayon_1.3.4 splines_3.5.3 locfit_1.5-9.1
[46] pillar_1.3.1 codetools_0.2-16 glue_1.3.1
[49] evaluate_0.13 BiocManager_1.30.4 httpuv_1.5.1
[52] gtable_0.3.0 purrr_0.3.2 assertthat_0.2.1
[55] xfun_0.6 mime_0.6 later_0.8.0
[58] survival_2.44-1.1 tibble_2.1.1 shinythemes_1.1.2
[61] memoise_1.1.0
Arese, Marco, Federico Bussolino, Margherita Pergolizzi, Laura Bizzozero, and Davide Pascal. 2018. “Tumor Progression: The Neuronal Input.” Annals of Translational Medicine 6 (5).
Conev, Nikolay V, Eleonora G Dimitrova, Margarita K Bogdanova, Yavor K Kashlov, Borislav G Chaushev, Dilyan P Petrov, Chavdar H Bachvarov, et al. 2019. “RIPK3 Expression as a Potential Predictive and Prognostic Marker in Metastatic Colon Cancer.” Clinical and Investigative Medicine (Online) 42 (1): E31–E38.
Di Giorgio, Eros, Wayne W Hancock, and Claudio Brancolini. 2018. “MEF2 and the Tumorigenic Process, Hic Sunt Leones.” Biochimica et Biophysica Acta (BBA)-Reviews on Cancer 1870 (2): 261–73.
Fearon, Eric R. 2011. “Molecular Genetics of Colorectal Cancer.” Annual Review of Pathology: Mechanisms of Disease 6: 479–507.
Hsu, Stephen, Fei Huang, and Eileen Friedman. 1995. “Platelet-Derived Growth Factor-B Increases Colon Cancer Cell Growth in Vivo by a Paracrine Effect.” Journal of Cellular Physiology 165 (2): 239–45.
Lorenc, Zbigniew, Dariusz Waniczek, Katarzyna Lorenc-Podgórska, Wiktor Krawczyk, Maciej Domagała, Mateusz Majewski, and Urszula Mazurek. 2017. “Profile of Expression of Genes Encoding Matrix Metallopeptidase 9 (Mmp9), Matrix Metallopeptidase 28 (Mmp28) and Timp Metallopeptidase Inhibitor 1 (Timp1) in Colorectal Cancer: Assessment of the Role in Diagnosis and Prognostication.” Medical Science Monitor: International Medical Journal of Experimental and Clinical Research 23: 1305.
Manzat Saplacan, Roberta M, Loredana Balacescu, Claudia Gherman, Romeo I Chira, Anca Craiu, Petru A Mircea, Cosmin Lisencu, and Ovidiu Balacescu. 2017. “The Role of Pdgfs and Pdgfrs in Colorectal Cancer.” Mediators of Inflammation 2017.
Qiu, Xia, Jianguo Jiao, Yidong Li, and Tian Tian. 2016. “Overexpression of Fzd7 Promotes Glioma Cell Proliferation by Upregulating Taz.” Oncotarget 7 (52): 85987.
Radin, Daniel P, and Parth Patel. 2017. “A Current Perspective on the Oncopreventive and Oncolytic Properties of Selective Serotonin Reuptake Inhibitors.” Biomedicine & Pharmacotherapy 87: 636–39.
Rahman, Mumtahena, Laurie K Jackson, W Evan Johnson, Dean Y Li, Andrea H Bild, and Stephen R Piccolo. 2015. “Alternative Preprocessing of Rna-Sequencing Data in the Cancer Genome Atlas Leads to Improved Analysis Results.” Bioinformatics 31 (22): 3666–72.
Rebecca L. Siegel MPH, PhD, Kimberly D. Miller MPH Ahmedin Jemal DVM. 2015. “Cancer Statistics, 2015.” CA: A Cancer Journal for Clinicians 65 (1): 5–29.
Tomonaga, Takeshi, Kazuyuki Matsushita, Seiko Yamaguchi, Tatsuya Oohashi, Hideaki Shimada, Takenori Ochiai, Kinya Yoda, and Fumio Nomura. 2003. “Overexpression and Mistargeting of Centromere Protein-a in Human Primary Colorectal Cancer.” Cancer Research 63 (13): 3511–6.